Structural and energetic heterogeneity in protein folding 

Steven S. Plotkin and Jose N. Onuchic 
Department of Physics, University of California, San Diego 
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A general theoretical framework is developed using free energy functional methods to un- 
derstand the effects of heterogeneity in the folding of a well-designed protein. Native energetic 
heterogeneity arising from non-uniformity in native stability, as well as entropic heterogeneity 
Q ,' intrinsic to the topology of the native structure are both investigated as to their impact on 

the folding free energy landscape and resulting folding mechanism. Given a minimally frus- 
^0 ■ trated protein, both structural and energetic heterogeneity lower the thermodynamic barrier 

to folding, and designing in sufficient heterogeneity can eliminate the barrier at the folding 
transition temperature. Sequences with different distributions of stability throughout the pro- 
tein and correspondingly different folding mechanisms may still be good folders to the same 
structure. This theoretical framework allows for a systematic study of the coupled effects of 
energetics and topology in protein folding, and provides interpretations and predictions for 
future experiments which may investigate these effects. 
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1. INTRODUCTION 



Theories of protein folding currently focus primarily on predicting properties of the folding mechanism given 
the native structure and/or energy function is known a priori. The most powerful approach to this end has 
been the energy landscape theory, used in one form or another in most descriptions of folding |l|-|15|. This 
approach takes advantage of the huge number of conformational states available to a protein by treating the 
£j ' energetics of those conformations statistically, just as the description of a phase transition from a liquid to a 

crystal is understood through the application of statistical mechanics to the numerous degrees of freedom in 
the problem. However in understanding the self-organization of proteins and biological systems in general, it 
is necessary to study properties particular to finite-sized systems, e.g. barrier heights and corresponding rates. 

fvq ■ For a finite system such as a protein, characteristic features present in the amino acid sequence give rise to 

residual signatures in thermodynamic and kinetic properties. For example, although the four-helix proteins 
Im7 and Im9 are structural homologs, Im9 folds by a two-state mechanism while Im7 folds through an en 
route intermediate |16] : the free energy landscape may fluctuate sequence to sequence for aa chains that fold 
to the same native structure. Other experiments also indicate that rates and/or intermediates may differ for 

^^ ■ structural homologs []17 , 18|. 



Since a knowledge of the native structure does not completely determine the free energy profile, we might 
ask what information does, and also what parameters must be known to predict other properties of the folding 
mechanism, such as the specificity or diffusivity of the folding nucleus, for example 119-02]. By analyzing 
the energetic statistics of ensembles of states, landscape theory provides a framework to distinguish folding 
processes common to an ensemble of sequences from those peculiar to individual sequences. A particular 
property, for example folding transition temperature T F , is not strongly dependent on the detailed Hamiltonian 
of the protein, but only on a few thermodynamic parameters such as overall native stability, chain stiffness, 
overall variance in energies (if for example the sequence were thread through random structures) , and perhaps 
also on chain length and overall hydrophobicity which induces generic collapse. The folding temperature 
may be expected to be a universal or self-averaging property for the ensemble of sequences having these 
parameters. Such quantities are important since they are amenable to analysis: the transition temperature 



5_i 

may be expressed quantifiably through the parameters mentioned above as 

(1.1) 




where e is the average native stability per residue (e < 0) , s is the entropy gain per residue in unfolding which 
depends on chain stiffness and also net hydrophobicity, b is the non-native energetic variance per residue, a nd z 



is the average number of neighbors per residue which is weakly a function of the chain length N. Equation (1.1) 



further gives the folding temperature for new sequences if these parameters can be determined, or conversely 



eq. (1.1) may help to determine intrinsic protein parameters from a measurement of the transition temperature. 



Equations for self-averaging properties provide a simple framework for understanding the phenomena; for 



example if the variance is weak compared to the stability (the protein is well-designed) then the folding 
temperature is to the first approximation proportional to the stability gap over the entropy of unfolding: 
T F « —E N /S . So for example a well-designed stiff protein has a higher transition temperature than a flexible 



one, since it has less conformations per residue in the unfolded state. Moreover, by eq. (1.1) T F is always below 
z|e|/s when non-native strength b = 0, so non-native interactions always lower the folding temperature by 
stabilizing the unfolded state. As b is increases, T F decreases but remains real so long as p3| 



the equality in (1.2) holding when T F reaches its minimum value of sj 'z/2s b which is precisely the glass 
temperature T G , defined as the temperature where the entropy of a random aa sequence is no longer extensive, 
i.e. where a random system becomes trapped in one of a small number of low energy states. Because T F is 
found by equating the total free energy of the folded and unfolded states, it should be only weakly sensitive to 
the details of the actual distribution of native stability within the protein. On the other hand, properties such 
as the folding barrier and its corresponding rate may depend strongly on the distribution of native stability 
throughout the protein. 

To theoretically treat the thermodynamics of folding and unfolding, we quantify a model protein in terms 
of the full native Hamiltonian {ei}, as well as the full distribution of native contact lengths {£i}, under the 
assumption that the protein under consideration is well-designed with T F reasonably larger than T G , i.e. the 
inequality (|1.2|) is easily satisfied. Native heterogeneity is retained explicitly, whil e non- nati ve interactions 



are treated through an average background field- the scalar quantity b in eq.s (1.1) and (|L2j). We are thus 
isolating the effects of native heterogeneity on the folding mechanism. 

Certainly if the entropy around the transition state were small, as when the inequality ([L2) is not well- 
satisfied, the position and height of the rate determining barrier(s) would fluctuate wildly sequence to sequence. 
However, since proteins h ave evolved native stabilities larger than their RMS non- native energy scale 23|~29| as 



manifested in inequality (1.2), the temperature T F where the native state is stable is sufficiently higher than T G , 
such that there is an extensive amount of residual entropy left under folding conditions. Nevertheless, we find 
that even though there is a large entropy present in the system, there are still in fact strong dependencies of the 
barrier and folding mechanism on the distribution of native stability and distribution of native contact lengths. 
In fact local fluctuations in native stability and structure do not average out, but contribute extensively in 
determining properties of the folding mechanism! 

For a property which is not self-averaging over a given ensemble of sequences, the parameters specified 
to determine the ensemble are either not sufficiently accurate or are incomplete. For example, the folding 
transition temperature T F is not self-averaging over the ensemble of sequences that fold to a particular native 
structure, since these different sequences may have different native stability, flexibility, etc. Nonetheless, a 
quantity such as folding temperature or folding barrier which fluctuates over an incompletely specified ensemble 
may have a mean that is still useful in characterizing trends as a function of one or more parameters. An 
example is the increase on average in foldin g ra te, or decrease in folding barrier, as mean native contact 



length £ (more specifically l/N) is decreased |30|, for which several models have b een proposed |3l|,|32|], and 



which we consider here within our theoretical framework as well (see section 5 5.G). The observed correlation 
between rates and £ implies that many proteins are sufficiently well-designed such that native topology plays 
an important if not dominant role in governing folding mechanism, a topic recently investigated by several 
authors ||Q. 

Our intention here is to go beyond the first moment of the contact length distribution £ or stability distri- 
bution e. We investigate how the full distributions of energetics and topology as well as correlations between 
them affects the free energy profile , corresponding barrier, folding rate, and overall folding mechanism. For one 



example, we find in section 5 5.H that there is a net correlation between structural variance defined through 
the second moment of the contact length distribution and folding rate for well-designed proteins with a given 
mean contact length £. Expanding on our previous work |43] , we find that native heterogeneity, both entropic 
and energetic, plays an important role in quantifying protein folding mechanisms. We show how one can 
extend the analysis of thermodynamic quantities by using functionals to describe folding properties which are 
not necessarily self-averaging but which may depend on distributions of coupling parameters. To this end 
we derive a simple field theory with a nonuniform order parameter to study fluctuations away from uniform 
ordering, through free energy functional methods introduced earlier by Wolynes and collaborators |34|, [l4|. fl5| . 
The theory is in good agreement with lattice simulations also performed in this paper. Similar effects have also 
been observed in Monte Carlo simulations of sequence evolution for lattice protein models, when the selection 
criteria involves fast folding rate p6| . Here we see how, from general considerations of the energy landscape 



theory, selecting for rate can induce heterogeneity in the transition state ensemble. The folding barrier for a 
well-designed protein is maximized when the nucleus is the most diffuse. For typical values of native energies, 
well-designed proteins have heterogeneous funneled folding mechanisms [p7|-pl|. 

Our results are also supported by several experiments in the literature as described in the conclusions 
section, and suggest experiments to be performed. For example the reduction of barrier height with folding 
heterogeneity should be experimentally testable by measuring the dependence of folding rate for a well-designed 
protein on the dispersion of ^-values |52j| . It is important that before and after the mutation(s) the protein 
remains fast-folding to the native structure without "off-pathway" intermediates, and that its native state 
stability remain approximately the same, perhaps by tuning environmental variables. 

In the arguments below we associate reductions in the free energy barrier AF$ to increases in the folding 
rate fc F . This is true as long as the prefactor k in the expression for the rate 

fc F = k e~ AF % I T (1.3) 

is more weakly affected than the barrier height under redistribution of native stability. While the distribution 
of native stability must indeed couple with the specific distribution of non-native interactions, for well-designed 
proteins with large transition-state entropy, it is more likely that the effect on the prefactor comes from the 
coupling of the transition temperature T F to the distribution of native stability, as long as the protein still folds 
to the same native structure. In other words we must consider the effect on the prefact or a s the ratio of the 



transition temperature to glass temperature T F /T G changes, i.e. as the left hand side of (1.2) changes and the 



protein becomes more or less well-designed. However we find (see fig.s [12 and 19) that even for perturbations 
in native energy large enough to kill the barrier, the folding temperature varies only weakly compared to 
the changes in barrier height, so that the ratio T F /T G should also not vary significantly compared to the 
changes in barrier height. Thus the barrier height is the strongest determinant of folding rate in well-designed 
proteins. For Go-like models with native interactions alone, the energy distribution does not strongly affect 
the prefactor: rates obtained from simulations are well fit with those predicted from the change in barrier 
height alone (see fig. |T§| ), i.e. the energy distribution does not strongly affect the reconfiguration kinetics 
appearing in the prefac tor, compared to its effect on the barrier height. For a full discussion of the above 
effects see section 



5 5.F 



When the Hamiltonians consists of pair interactions alone , re distributions of native stability can eliminate 
the barrier entirely at the folding temperature, see figures |ll| and |l7|. It is worth noting that many-body 
interactions which are believed to be present in real protein interactions p3|-p9[| tend to increase the barrier 
height |6C|-|62|| , and in their presence the barrier may be more robust to redispersement of native stability. 

A funnel folding mechanism consisting of a large number of routes to the native structure is preserved for 
a wide variety of folding scenarios and barrier heights, including folding through on-pathway intermediates. 
For the distributions of native energy necessary to induce folding through one or a few routes, the folding 
temperature drops by about a factor of six, which indicates that for realistic energy functions which are also 
composed of non-native interactions, folding would be exceedingly slow at the low temperatures where the 
native state would be stable. However for proteins which are large and multi-domain, it is possible that 
entropic or energetic heterogeneity may induce significantly route-like folding near the transition temperature. 

The paper is organized as follows. Firs t in section we outline the strategy of the calculation and illustrate 
it with a simple example in section p 2. A . Next we anticipate and explain several of the results with physical 



arguments in section pr its reading is not essential for the rest of the paper, but should be very helpful 
to the reader. In section the full expression for the free energy functional is derived, and the functional 
is implemented in section lq where comparison is made to the results from lattice simulations. Finally we 
conclude and suggest future research. 

2. THE GENERAL STRATEGY 

It is first necessary to characterize the generic properties of the native state. We adopt a coarse grained 
description for the native structure, and describe it by its distributions of native contact energies {e^} and 
native loop lengths or contact lengths {£i} (see figure [l]). Here ti is the solvent averaged effective energy of 
contact i, and ii is the sequence length pinched off by contact i (see fig. Ill) p3f. We use a single subscript 
for the labeling index i because we are only considering effects on the particular set of native contacts for a 
given native structure. Non-native interactions are treated by an average field, since the protein is assumed 
to be well-designed to its native structure, and native interactions arc then most important in determining 
the folding mechanism. The index i runs from 1 to M, where M is the number of residue pair contacts in the 



lowest energy native structure. M scales roughly extensively, i.e. M — zN, with N the number of residues 
in the polymer chain. Here z is the mean number of contacts per residue: a function of either the lattice 
coordination number or the off-lattice cut-off length. It is of order 1, with surface area corrections dying away 
as TV -1 / 3 pM. We can quantify nativeness in the first approximation by the fraction of native contacts Q, with 
< Q < 1. Other parameters are also reasonable for stratifying the landscape: the fraction of correct (native) 
dihedral angles p7|], coarse grained position in space in the native structure [ |S5| , [66| , or even the ensembles 
having a given probability to fold before unfolding fl67| . But Q is the most suitable for calculation in the 
present theory. At partial degrees of nativeness the probability to form contact i is defined as Qi(Q), and 
we define Q*(Q) as the fraction of time contact i is formed at equilibrium in the ensemble with MQ native 
contacts, or equivalently the fraction of proteins in a macroscopic sample with a given degree of nativeness 
having that contact formed. Non-uniformity in Q* values at partial degrees of nativeness would indicate that 
the protein prefers to fold some regions over others. 

Following the formalism used in inhomogeneous fluids pq,p9] and the theory of first order transitions j70| 
we write a free energy functional F({Qi} , {e^} , {£%}) to characterize the effects of structural and energetic 
heterogeneity superimposed on the overall folding funnel. This approach has been used earlier by Bohr and 
Wolynes to describe domain growth in proteins |71[] and more recently as a calculational tool for experimental 
0- values p^,|44|p[. 



The free energy functional is first interpreted as depending upon the local contact probabilities Qi(Q) = 
(Q(ri — rf )) T (Q) where i labels the native contact between two residues, r t the distance between them, 6 
is a function that measures proximity such as a step function for off-lattice models or a Kronecker delta for 
non-bonded nearest neighbor sites on-lattice, and (• • •) indicates the an average over the ensemble at Q. We 
will typically take (• • ■) to be a Boltzmann weighted average; then Q* is the thermally averaged fraction of 
the time two parts of the protein are in proximity (contact) p2[: 

QKQ) = (4> T - E W. c) exp( ^ c/T) (2.i) 

c£Q 

where 5(i, c) = 1 if contact i is made in configuration c, and S(i, c) = otherwise. The sum may be taken over 
any ensemble of theoretical interest. Here we have in mind the ensemble defined as having a given degree of 
overall order Q = (1/M) £\ Qi ©■ 

In the functional method, all the contact energies {e^} and loop lengths {£i} for a protein are initially 
assumed as given, and the thermal (most probable) distribution of contact probabilities {Q* ({e^} , {^} , Q)} 
is found by minimizing the free energy functional F({Qi (Q)} | {ti} , {£i}) with respect to the distribution 
of occupation probabilities, subject to the constraint that the average probability is Q, i.e. Yli Qi — MQ 
(Q then parameterizes the values of the Q^s). Examples of the functions Q*(Q) are plotted in figures || 
and |b|. This procedure is analogous to finding the most probable distribution of occupation numbers, and 
thus the thermodynamics, by maximizing the microcanonical entropy for a system of particles obeying a giy en 



occupation statistics. Here the effective particles (the contacts) obey Fermi-Dirac statistics, (see eq. ( 4.35 )), 
since no more than one bond can "occupy" a contact. The system can be understood to have a set of free 
energy levels obeying a distribution governed by the native structure and energies of the protein, and we seek 
the fraction of time (the probability) those levels are occupied given a fixed overall number of levels filled. 

The free energy for a system obeying the thermal (most probable) distribution {Q* (Q, {e^} , {4})} is then 
considered a function of the contact energies for a fixed native structure: F(Q, {ej} | {£i})- That is, we consider 
the folding free energy barrier as a functional of the interaction energies {e^} for a given native topology. 
The free energy depends on the energies {e^} both explicitly and implicitly through the thermal contact 
probabilities {Q* ({e^} \Q, {£i})}- Then we can seek the special distribution of contact energies {e*(£i)} that 
extremizes (minimizes or maximizes, depending on the second derivative) the thermodynamic folding barrier 
to a particular structure by finding the extremum of ^({e^} | {£i}) with respect to the contact energies £j, 
subject to the constraint of fixed total native energy: 53j ti = Me = E N , i.e. while maintaining the same overall 
stability of the native structure. Thus we are isolating the effect of heterogeneity on the folding mechanism. 
This distribution when substituted into the free energy gives in principle the extremum free energy barrier 
as a function of native structure F^{{£i}), which might then be optimized for the fastest/slowest folding 
structure and its corresponding barrier. However we found that in fact the only distribution of energies for 
which the free energy was an extremum is in fact the distribution which maximizes the barrier by tuning 
all the contact probabilities to the same value: Qi{Q^) = Q^ ■ In this case the coupling energies would 
be tuned to eliminate any information contained in the native structure, except for the mean loop length 
£ = {1/M) 53 . £{. Any perturbation away from this scenario lowers the free energy barrier. We also examine 



the effects of structural dispersion on the barrier, i.e. a free energy for variable loop distribution but fixed 
coupling energies F(Q, {£i} \ {£%}), and arrive at the same conclusion: for fixed energies, increasing structural 
variance (at fixed average loop length) lowers the barrier and thus speeds the rate, as long as the protein is 
sufficiently well-designed that the rate is governed by the free energy barrier. 



2. A. An example 



We illustrate the procedure by applying it to a more trivial system- an Ising paramagnet in a non-uniform 
external field. The Hamiltonian for this system is 



N 



H = - ^ 6i(Ti 



(2.2) 



i=i 



where ctj = ±1 is the ith spin and e^ is its local field energy. To obtain the free energy functional we need an 
expression for the entropy in terms of the spin degrees of freedom. If the field was uniform, the entropy per 
spin s(q) could be written in terms the fraction of up spins N + /N = q as 



*(«) = ^ = im 



AH 



N N (Nq)\[N(l-q)]\ 



= [-qlnq-(l-q)ln(l-q)} 



(2.3) 



Here q = (1 + a)/2 where a = (1/-/V) J^. Oi is the average magnetization per site. However if the field varies 
from site to site so will the equilibrium value of the spin. To allow for this the entropy per spin must be written 
in terms of qi — (1 + 0j)/2, and the total entropy is then a functional S({qi}). The free energy functional is 
then 
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(2.4) 



The equilibrium values of the spins a* are obtained by finding the extremum of the free energy, perhaps 
subject to a fixed overall magnetization j74| M — J^ cr,: 



— [F + h u y E* j =0. 



dcr 



(2.5) 



This leads to the equation 



at = tanh 



T 



(2.6) 



for the equilibrium values of the spins. Each spin follows its local field according to the well-known Brillion 
function (for spin 1/2). The Lagrange multiplier h M is determined from the sum ^ a* — M . For a uniform 
field e, h M = T tanh - (M/N) — e. The second variation of F is positive indicating the free energy is minimized 
and a* are the thermal equilibrium values: 
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(2.7) 



Substituting a*{ei) (eq. (2.6)) back into the free energy functional (2.4) gives the free energy in terms of the 
coupling energies {ti\: 



F({ej}) 
T 



N 



= -iVln2-^ln 
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This is equivalent to the form obtained from the partition function in the canonical ensemble. 

Now we can seek the set of fields e* that extremizes the free energy subject to a given total coupling energy 



^(^5>J=0. (2.9) 

This yields the condition that all the spins have the same value and thus that the field be uniform: 

a l (e*)=p (2.10) 

e* = e (2.11) 



However, second variation of the free energy gives 

6 2 F 



Stj Set 



{u}={e} 



, f sech 2 y T 



^■±sech 2 (^) (2.12) 



which is negative, indicating this choice of coupling energies maximizes the free energy. Thus any perturba- 
tions away from the uniform field will lower the free energy. Although the entropy functional is much more 
complicated for proteins, we find that there too the only free energy extremum is a maximum. 

3. PHYSICAL ARGUMENTS FOR THE EFFECTS 

Before we present the full free energy functional theory, we include here some physically motivated arguments 
for the effects to give the reader a deeper intuitive feel for the results derived later within the more general 
framework. This section is fairly independent of the rest of the text; it is broken up into subsections which 
may be skipped or read in any order. The first subsection concerns the effect on rates by changing the 
interaction energies of contacts that are likely or unlikely to begin with. The second subsection consists of 
various proofs describing how heterogeneity lowers the barrier and the consequences of this phenomenon, and 
in the third subsection we show a simple proof that a protein with stability distributed uniformly has maximal 
conformational entropy. 

3. A. Making early-forming native contacts relatively stronger will tend to speed folding more than 
making late-forming native contacts relatively stronger stronger by the same amount 

The argument proceeds as follows [JF5| . First notice that 

jr = 3r (- TkZ ) = ?t?^, (3.i) 

c 

where the sum is over all the states with a given similarity Q to the native. Since t he e nergy of conformation 



c is a sum of its contact energies: E c = Y] - Ec , e^, dE c / ' dti = 8{i, c) and thus by eq. (2.1) 



dF 

%- = Q* ■ (3.2) 



This result, derived below within the functional formalism (c.f. eq. ( |5.8| )), means that the change in free energy 
5F in perturbing a contact i an amount Set is equal to the amount of that perturbation times the fraction of 
time that contact is formed. An analogous equation holds for an inhomogeneous fluid, where the density of 
fluid at position x, n(x), plays the role of contact probability and the external field at x, u(x), plays the role 
of the perturbation: SF/Su(x) = n(x). 

Now imagine taking two contacts % = 1,2 within the protein having formation probabiliti es Q i,Q 2 , and 
making equal and opposite energetic perturbations on them 5e > (see fig. ||). Now by eq. (3^2), the total 
change in free energy to first order is 

5F = -Q 1 5e + Q 2 5e = -{Q 1 -Q 2 )5e (3.3) 

so if Qi > Q 2 the change in free energy is negative and if Q 2 > Qi, SF > 0. Since contacts are typically 
unformed or less formed in the unfolded state, we can say that if Qi(Q^) > Q 2 (Q' ! ), SAF^ < and vice- 
versa. Since for well-designed two-state folders the rate is controlled by the free energy barrier, the assertion 



is then demo nstrated (c.f. the discussion in the introduction regarding barrier governed rates, and also see 



section 5 5.F). Some obvious caveats include perturbations on a protein not well-designed, perturbations of 
contacts involving residues anomalously formed in the unfolded state, or situations where strengthening one 
of the contacts lowers the free energy of an on-pathway intermediate; for these exceptional cases the effect 
may not be observed. 

3.B. Adding native heterogeneity will always lower the thermodynamic folding barrier in a 

well-designed protein. 

A homogeneously ordering protein is equally likely to fold from anywhere within it. As perturbations are 
made away from this scenario, say in native interaction energies or through structural variance, the folding 
barrier tends to decrease. Several arguments detailed in the following subsections suggest this effect. 

3.B.I. Random Energy Model method 

Consider making random energetic perturbations on the contact energies of an initially homogeneous ideal- 
ized system (where all contact probabilities are the same: Qi — Q) with free energy barrier F HOMO and folding 
rate /c exp(— F HOMO /T). Then the new rate is 

k f = k exp ( HOM ° J = fc H oMo exp ( — J . (3.4) 

If the total native (unconstrained) energetic variance J^. Sef is A-EjJ, the variance at the transition state is 
approximately AE K * = Q*(l — Q$)AE^, given the energies must sum to total native energy E N . The variance 
vanishes at Q = since there are no contacts made there, and vanishes at Q — 1 since all the J^ ej = E N , i.e. 
all the energies must sum to a fixed number and thus their sum cannot vary. Approximating the transition 
state as an ensemble of states with uncorrelated energies, i.e. a random energy model [J76|, and considering 
only the effects of changing native interactions, the energy will always decrease twice as much as the entropy 
times the temperature under the influence of heterogeneity. Thus the free energy barrier decreases: 

«m - «cn - rsm = -^ - ^ = - °' (1 - 2 g' )A£ " , (3. 5 ) 



and the rate in eq. (3.4) increases as 



k f ~ fc HO Mo exp I — j . (3.6) 

This crud e argument yields essentially the same result as a much more detailed functional analysis, c.f. 



cq. ( 5.19 ) of the text; an additional factor appears there to account for polymer effects on the number of 
routes to the native state. By this argument even for an initial unperturbed funnel which is fully symmetric 
(an idealized case where all contacts are equally likely to be formed), introducing arbitrary heterogeneity 
lowers the folding barrier. 

As we will see below, since energies and entropies enter the expression for Qi on the same footing, the above 
statements apply to the dispersion in loop entropies inherent to a particular native structure, see fig. E2L 

3.B.2. Argument from Transition state theory 



The result (3.6) is not surprising from the point of view of transition state theory. Another way of describing 
it is to note that the time rate of change of the population J\f in a metastable state is proportional to the 
escape rate, and the escape rate is itself proportional to the ratio of partition functions, transition state to 
reactant: 

J\[ Z° y ' 



Now we imagine the transition state to be composed of an ensemble of &/ microstates of the system having 
energies Ef — E + SEi, where 5E{ is distributed state to state from a Gaussian distribution: P(SEi) ~ 
exp{~(SE t ) 2 /2Nb 2 ). Then 

Z * = E e ~ 0Ef ~ e ~^ n * ( e ~^ E ) = Z*e m2 / 2T2 . (3.8) 

i 

So, neglecting changes in the prefactor, the rate increases exponentially with the variance in the transition 
state, which scales extensively with system size. 

3.B.3. Optimum fluctuation method 

Applications of nucleation in disordered media U% [T8J to protein folding show similar trends in the folding 
rate. In a system such as a protein there may be regions where nucleation of the folded state is favored due to 
local energetic or entropic inhomogeneities. These may speed the rate of nucleation by decreasing the effective 
thermodynamic nucleation barrier, referred to in the nucleation literature as the optimum fluctuation (see 
fig. 0). In the theory of electronic band tails in disordered systems the optimum fluctuation method has been 
used to calculate the density of states at the mobility edge ]79|-|8l|] . 

Let the folding rate for a given nucleation barrier F$ be k(F$) — fc exp (— F*/T) and the probability 
distribution of nucleation barriers be defined as P(JF*) = cxp (— (f>(F 1f )^ . Then the average rate in the presence 
of a distribution of barriers is 

k = fdFP(F)k(F) « k e~ G{T) (3.9) 

where 

G{T) = ™ n U{F) + pj (3.10) 



The decrease in activation barrier by the REM argument above amounts to letting the free energy barriers be 
Gaussianly distributed about a mean F with variance A 2 , so 4>{F) as (F — F ) 2 /2A 2 . Then straightforwardly 
from ( |3.10D the effective barrier F*(T) = F t - A 2 /T and thus 




/•' '-= '■: l 'M> ! -— + 7^ ) = fc HOMo exp ( — j , (3.11) 



which has the same form as eq. ( p.q ), since this is essentially the same argument as above. 

We could have just as well found the disorder averaged time to nucleate a folded structure, from 
r = J P(F)t(F) — r exp(F jT + A 2 /2T 2 ). Large barriers increase the mean time above zero-disorder 
value when the averaging is done. This does not mean the rate observed in an experiment slows, since the 
important quantity g overning the decrease in reactant population is the transition rate k, as described above 
in section 



3 3.B3.B.2J . 



Example: Free energy potentials observed in simulations. Consider for example the free energy profile 
obtained from off-lattice Monte Carlo simulations of a uniform energy Go model to the native structure of 
CheY (see fig. gj). The profile is obtained by Boltzmann sampling the states and partitioning them to different 
ensem bles g iven their overall number of native contacts. This profile is proportional to the function G(T) above 
in eq. ( |3.10|) , i.e. it contains by construction the reduction in barrier due to structural (entropic) fluctuations, 
and so should be a good predictor of the rate. The probability a particular native core (illustrated in fig. ph 
is sampled is 

Pc » e -^/ T+SH = e~ F ^ T (3.12) 

where E c is the energy of the native core (E c — E N Q for all cores in the uniform Go model) and S H is the 
entropy of the polymer halo dressing the core. The role of native cores and halos in calculating free energy 



profiles was discussed in |61|. So up to a Q independent constant, the free energy at Q is obtained from the 
relative probabilities as 



;(Q)~-log 



E 



-F C /T 



log I dF c n{F c )e- Fc/T 



(3.13) 



where the sum has been replaced b y an inte gral o ver the number of cores at Q having free energy F c . 
Equation ( |3.13 ) is analogous to eq.s (3J3) and ( 3.10 ) above obtained from the optimum fluctuation method. 
As a limiting case consider an idealized folding funnel with no dispersion in energetics or entropies. Then 
n(F c ) = exp(S ROUT E )6(F c — F c ), where exp(S , ROU TE) is the number of cores, or routes through the bottleneck. 
The free energy in (3.13) then becomes 



(Q) 



E(Q) 
T 



Sn(Q) - S TI 



.(Q) 



(3.14) 



which reproduces the mean-field free energy profile [pTJ in the absence of non-native interactions. When there 
is dispersion in the free energies of a partially structured protein, there is a distribution of core free energies. 
Heuristically we can approximate this by a Gaussian distribution: 



n(F c ) 



^Sroute 



P(F C ) « exp S R 



F a 



F C r 



2A 2 



(3.15) 



Then 



F 



F 



dFoe -Fc/T-(F c -Fof/2^ 



^ROUTEi^; ) 



_ A^_ 
T 2T 2 



(3.16) 



So the barrier observed by sampling the Monte Carlo data or running molecular dynamics is the optimal 
fluctuation barrier, which includes in it the lowering effect due to structural dispersion in fig. (pft and both 
structural and energet ic disp ersion in general. The lowering of the barrier due to structural variance is further 
investigated in section 5 5.H| (see also fig. |22| ) . Additionally, experiments which monitor equilibrium properties 



related to relaxation rates or native structure formation meas ure th e optimum fluctuation, as mentioned above 
in the context of transition state theory. S HOUTE (Q) in eq. ( 3.16 ) is the log of the total number of possible 
native cores at Q. This quantity will appear in the free energy functional below, where after the functional is 
minimized it is interpreted as the number of thermally accessible routes at a given temperature. 



3.B.4- Thermodynamic perturbation theory 



Another argument for the lowering of the barrier makes use of thermodynamic perturbation theory |82] . 
Consider a Go model with M contacts, whose configurational states are perturbed in energy by a random 
contribution V c = SE C so that the new energy of state c is E c — E° + V c - Let the native energy be unchanged: 
SE = in the native state. Then the change in free energy to second order in V is 



SAF(Q) = (V)-±((V-(V)) 2 ) 



where 



(V) 



^J2V c eM-E° c /T) = {SEY 



(3.17) 



(3.18) 



is calculated by summing over all configurations c having Q native contacts. Now since the change in a 
configuration's energy is the sum over perturbations of native contacts made in that state, 



{5E)' =J25E C 



-E a /T 



c£( 



A/ A/ 



(3.19) 



The last equality follows from eq. (2.1). Thus the first order change in free ene rgy is simply the sum of the 
perturbations times the fraction of time those perturbations are felt, as in eq. ( p . 3[) . However here the first 
order term is the sum of a large number of random uncorrelated terms, and so is Gaussianly distributed over 
realizations of the perturbation. The mean of this distribution is zero since the perturbation is randomly made 
contact to contact: 



M 



6AFt = J2 QiSei = MQ 6e = 0, 

i 

i.e. Se = (1/M) XV $ € i = 0, because the native energy is unchanged |p3| . The standard deviation 



(£AFt) 2 = y/MQ(l-Q)b 



(3.20) 



(3.21) 



scales like \N since M = zN . Therefore the first order term in (3.17) will be ± const, x N 1 / 2 . Here we've 
let the individual contact variance Sef — b 2 . Similar arguments of the effects of heterogeneity on the barrier 
were considered in [03. 



On the other hand, the second order term in (3.17) is proportional to {SV 2 ) and scales like N and is 
always negative. By the reasoning in eq. Q3.19 ) the average, over realizations of native disorder, of the thermal 
fluctuation is 



M 



(V 2 ) (V) 2 = J2 5e t Se 3 [(6,6,) (Si) (S d )] 



i,j=l 



Since the perturbations are independent of each other the cross terms in the sum vanish: SeiSej 
b 2 Sij, and 

M 



(y 2 )-{vf = Y,^Qi(i-Qi), 



(3.22) 



6e 2 S,j = 



(3.23) 



i=l 



where t he las t equality follows from the fact that the fluctuations of partic les ob eying Fermi-Dirac statistics 
(c.f. eq. 4.35) obey the property (Sf) — (Si) — (Si) (1 — (Si)). The sum in (3.23) has the form of M positive 
terms and thus scales extensively (~ M) with the size of the system as opposed to the first order term. Thus 
the free energy change due to random perturbations in the native energies is negative in the thermodynamic 
limit. Since native contacts are less formed in the unfolded state than in the transition state, the change in 
barrier height is also negative in the thermodynamic limit. 

That higher order terms do not reverse the trend in barrier height can be ensured by the Peierls-Bogoliubov 
inequality F < F + (V) where F is the free energy in absence of the random component and V is the 
ran dom p art of the Hamiltonian averaged over the unperturbed states, which is just the first order term in 
eq. ( 3.17 ). Thus the transition state free energy (per volume) F(Q^)/N is always less than the unperturbed 
free energy F (Q^)/N in the thermodynamic limit, and since F(0) = F o (0) in the unfolded state, the barrier 
is always lowered. 



3.C. Distributing native energies uniformly in a well-designed protein maximizes the thermal entropy 

If we consider for illustration that native energies are chosen from a distribution which is the independent 
product of Gaussians 
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P({ei})«nexp(-^^) , (3.24) 

then the native energies will typically fluctuate around their average e on a scale 6, with the set of energies 
€i = e being the most likely distribution. We are ignoring for now any constraints on the native energies that 
might be present in real proteins, e.g. for functional requirements. We now show that this distribution of 
native energies maximizes the thermal entropy at any degrees of nativeness, for well-designed proteins. 
The entropy at Q is given by 

S(Q) = -^p a 1np a (3.25) 

a 

where the sum is over all states at Q, and p a is the probability that state a is occupied from the ensemble of 
states at Q. We approximate a well-designed protein by a Go model. Then if all the native energies arc equal 
(ei = e), all states at Q have the same energy E a = QE N . Then the partition function Z is fi(Q) exp(—QE^/T), 
where Q(Q) is the total number of configurational states at Q. Then the probability any state is occupied is 
1/Q<(Q), and the thermal entropy becomes 

S(Q) = lnft(Q) (3.26) 

which is the configurational entropy- the largest the thermal en tropy can p ossibly be. This result is recovered 



again using the general free energy functional theory in section 5 5.C 5.C.2, and can be seen in fig. hOl 

It should be noted that even though the entropy at the bottleneck is maximized for this choice of energies, 
the barrier is not minimized, and in fact may be lowered further by the addition of energetic heterogeneity as 
described above. 



4. FREE ENERGY FUNCTIONAL 

In this section we derive the free energy functional to be used in the main analysis. This should solidify 
the concepts outlined in the previous section regarding the effects of heteroge neity on the folding m echani sm. 



We first show how the functional is related to the Hamiltonian, as in section 2 2. A. Then in section 44.B the 



entropic terms present in the functional are derived. In 44.C the thermal contact probabilities are obtained 
by minimizing the free energy functional. 

4. A. Obtaining the functional from a Hamiltonian 

We can motivate the form of the free energy functional from landscape arguments, i.e. by considering energy 
distributions of states with structural similarity to the native. Consider a contact Hamiltonian Ti of the form 

H ({A aP } | {A^}) = J2 Zfi^cfiKp + ^A Q/3 (1 - A*,) (4.1) 

a</3 

which gives the energy of a particular configuration defined by the set of contact interactions {A Q ^}. This 
Hamiltonian gives energy e^« to the contacts which are native contacts, and energy e a p to non-native contacts. 
We embody the principle of minimum frustration fl24f by making the mean of the distributions from which 
native contact energies are chosen be lower than the mean of the distribution for non- native contact energies. 
Native contacts may also have a smaller variance, depending on the effective number of effective letters in 



the sequence. The energies in (4.1) are internal free energies of spatially short-ranged interaction between 
effective monomeric units, after averaging over side chain and solvent degrees of freedom. The double sum is 
over residue indices, and A Q/ g — 1 if residues a and (3 are in contact in a configuration, A Q/ 3 = otherwise. 
A^ g — 1 if these residues are also in contact in the native configuration, and A^ g = otherwise, e^g and e a /3 
are again the energies of native and non-native contacts respectively. 

We obtain the thermodynamics for this system by considering statistical properties of an ensemble of 
partially native states. Once the density of states n(E\ {Qi}) is known the thermodynamics at temperature 
T can be obtained. We obtain a statistical average of n (E\ {Qi}) from a knowledge of the overall number of 
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partially native states, and the probability each of these states has a given energy. A similar derivation for a 
homogeneous order parameter Q was calculated in J6l| . The probability a configuration with a particular set 
of native contacts {A Q ^A^o} has energy E, given the native state has energy E n , is given by 

P (E\E N , { A Q/3 A^}) =(S[E- H{A a ,}} S [E N - H{A^}} ) nm _ nat (4.2) 

where the averaging is over the non-native contact coupling energies, 

(' ' ^non-nat = T[ P ^<*P) de ^ • 



non — nat 



Residual features in the folding mechanism may be present due to non-self-averaging effects of non-native 
interactions, resulting in phenomena such as "off-pathway" intermediates. We smooth over such phenomena 
with the above averaging, leaving only an average non-native background field, while native interactions are 
explicitly retained. Thus "on-pathway" intermediates, or fluctuations in the free energy landscape as in fig. || 
due to native structural or energetic heterogeneity are retained in this procedure. Note H{A^g} is just the 
sum of the native interaction energies. Averaging the Fourier-transformed delta functions over non- native 
interactions chosen from a Gaussian distribution, 

results in 

P{m „ {iw*» - (arMt ,,;_ 0)) .,, -> (-i^f ) <«> 

where the sum over native contacts Y] a a e^oA^A^ is written in the shorthand single index notation £\ tiQi, 
i.e. Qi = A„(jA™g. Here Qi = 0, 1 but in the free energy functional, fractional values are allowed as in the 




derivation of eq. (2.4) (the entropy per spin would be zero if only integer values of the spin degree of freedom 
were allowed). We will see that the thermal values of the contact probabilities Q* = ( A„^A"o ) T are the 



fractional values that minimize the functional (c.f. eq. (4.35) in sect. 44. C). 

When the log density of states logn (E\ {Qi}) is large, it can be replaced by the disorder-averaged number 
Q({Qi})P(E\E N , {Qi}), since the relative fluctuations in the number die off as M -1 / 2 for uncorrelated disorder: 

k,n(^{ft}) H 5({Q0)- ( ^ . (4.4) 

The term S ({Qi}) is the configurational entropy, discussed below. The thermal energy E(T\ {Qi}) is obtained 
from the density of states above through d\ogn(E)/dE = T _1 : 

E (T\ {Q t }) = ]T uQi - ^f (1 - Q) ■ (4.5) 

This procedure is applicable in the high temperature regime when the number of states occupied at such 
temperatures is large. The energy consists of an integration over an energy density i.e. by an energy per 
contact times the probability that contact is made, e^Qi, summed over all contacts, minus a term cor respon ding 



to a lowering of the thermal energy due to the net effect of non-native traps. Substituting ( |4.5| ) into (4.4) 
gives the thermal entropy 

S (T\ {QO) = S ({Qi}) - ¥* (i _ Q) , (4.6) 

which consists of the entropy of the polymer chain subject to the geometric constraints {Qi} of contact 
formation, S ({Qi}), and a lowering term due to the presence of non-native traps (fluctuations in Boltzmann 
weights due to the fluctuations in state energies reduces the effective total number of states occupied). The 
temperature dependence of S({Qj}) appears through the implicit temperature dependence of the contact 



probabilities Qi (see eq. (4.35)). 
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At this point, since no exact solution for the entropy of a three-dimensional polymer containing topological 
constraints is known, we must either resort to an exact solution of an approximate, idealized model system, or 
an approximate phenomenological treatment of the real, exact system. We choose the latter approach for the 
theory, and the former approach in the lattice simulations. While still an approximation, the entropy we derive 
captures the same quantitative effects we see in the simulations, which of course contains an exact computation 
of the entropy for the idealized lattice model. When computing the entropy in the contact representation, we 
must calculate how much entropy the unconstrained polymer has, Ns , how much polymer entropy is lost to 
form a set of contacts consistent with an overall fraction Q of native structure, S BOND ({Qi}\ {ti}), and how 
much "mixing" entropy is contained in the diversity of contact patterns consistent with that overall fraction 
Q of native structure, S aoVTE ({Qi}): 

S ({Qi}) = Ns + S ROUTB ({Qi}) + 5 BOND ({Qi} | {li}) ■ (4.7) 



These contributions are discussed in detail in section 4 4.B 



The free energ y fu nc tion al at t emp erature T and nativeness Q is written as E — TS in terms of the field 



{Qi}, using eq.s (4.5), (4.6) and (4.7) 



F (T\ {Q t (Q)} | {ej , {£ t }) = J2 *Qi ~ TS ROUTE ({Q t }) - TS BONO ({Qi} \ {£ t }) 



— Mb 2 

+E(Q, rj) - NT So - — (1 - Q) . (4.8) 



The terms depending on {Qi} in eq. (4.8) involve integrations over the native density field {Qi}, while the 
remaining terms depend only on the uniform "background field" Q. We have included a mean energy E(Q, ri) 
dependent on Q and total packing fraction r\, the total configurational entropy Ns = Nlnv, where v is the 
number of configurations per residue in the unconstrained polymer, and the correction to the free energy due 
to non-native ruggedness —AE 2 (Q)/2T. These uniform terms are not central to our main analysis, which 
considers specifically the effects of native heterogeneity in structure and contact energy. 

We note in passing that for the ensemble of sequences with only overall stability J^i e i = -^n specified, 



rather than the whole distribution {ei} as in eq. (18) above, the non-native ruggedness decreases with Q as 
~ (1— Q 2 ) pM, rather than as (1 — Q) as above. This results from averaging over the native coupling energies 
under the constraint J2i e, = E N . 

The native stability gap is composed of a sum of 2-body interaction energies between M pairs of native 
residues. Cooperative contributions to the energy function |6lL p5[ necessary for de novo prediction [p5|,|8q] and 
accurately representing barriers |6l|p2| are not studied here, since native stability is present a priori in the 
free energy of our model, and thus we focus specifically on the properties of already well designed sequences 
to a given structure, for which cooperative effects should induce quantitative but not qualitative changes in 
the results presented here. 



For a Go-like set of interactions, collapse and folding occur simultaneously. E in eq. (4.8) can then be set 
to zero since all energetic contributions ^. Qiti are from native contacts. At the other extreme, if there is no 
change in density with folding and the total number of contacts of any kind is a constant, the term J^ Qi£% 
can be interpreted as the extra energy native contacts get. Then E is a constant, which again has no effect 
on the free energy. In an intermediate regime there is some non-native density coupling to progress along 
the folding reaction coordinate |31j. Since this subtle effect is secondary to and unnecessary for the analysis 
below, we ignore it and treat E as a constant. 

4.B. Entropic Terms 

If we imagine the ensemble of configurations that have a given amount of order, say a given number MQ of 
native contacts, then within this ensemble there are a multiplicity of sub-ensembles of states having different 
sets of MQ contacts, which we identify as a measure of the number of distinct routes in folding to the native 
state. Each sub-ensemble contains many states corresponding to the entropy of the disordered polymer around 
the particular native core (e.g. see fig. ||). We define the entropy that corresponds to the degeneracy of contact 
patterns having functional order {Qi(Q)} as <S ROUTE ({Qi(Q)}) {S ROUTE > 0), and the configurational entropy 
lost from the coil state to induce the ordering specified by {Qi} as 5 BO nd ({Qi} I {^i}) (Sbono < 0). 
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4-B.l. Route Entropy 
In capillarity models of nucleation [p7| , <S RO ute corresponds to the log of the translational partition func- 



tion 



P0| which scales logarithmically with system size, plus the entropy of surface fluctuations of droplets 
of a given size |9lL B2| which correspond to logarithmic terms in the expansion of the free energy density. This 
entropy is small compared to the total conformational entropy, however at the spinodal where F(Q) becomes 
downhill (e.g. long-dashed curves in fig. O ), the nucleus is of small amplitude and highly ramified M,M. 
In this regime the droplet structure is percolative as in spinodal decomposition of binary fluids, and the cap- 
illarity approximation is poor. Field-theoretic descriptions for the structure of the droplet are typically used 
in this regime [B4lB5fl. Binary fluid approximations to the route entropy in proteins which scale extensively 
with system size have been used in this limit [54lE3,|43,pl|,|7l|.^G,p^|. The amount of route diversity in fold- 
ing has also been analyzed in terms of the Shannon entropy 198 , which is similar in spirit to the following 
treatment p3|. We make no capillarity or spinodal assumptions, and treat the route entropy 5 ROU te ({Qi}) 
as a fairly simple modification of the entropy of a binary fluid mixture [p2J : 



exp6g OPT .(Q)= AfQ , (Jt fl MQ) , ^(n ?) " (4.9) 

n° = Q-0 (1 - Q)- {1 - Q) (4.10) 

which we interpret here as the product of the complexities per contact Q° and is readily generalized to the case 
where the complexities are not all equal: exp<S° OUTE {{Qi}) => IIi=i Q7 0- ~ Qi)~ , as in eq. (2.4) 



The principle modification introduced here for proteins is that, due to chain connectivity, as contact density 
increases, there is less sterically allowed space for a monomer to move around when one of its constraining 
contacts is broken. Thus not all M\/MQ\ (M — MQ)\ contact patterns have an entropy « Ns + 5 B ond- In 
other words making some native contacts forces spatially nearby contacts to be made because the corresponding 
monomers are forced to be in each other's proximity. So there is a reduction from the putative complexity 
(f2°) since not all M contacts are independently contributing to mixing, with several contact patterns 
corresponding to the same constrained state. Here we remove this degeneracy by dividing out the (O?) a ^'J' 
states that have been overcounted. Making a mean-field approximation for the local field around contact i 
which reduces its complexity, J2 a ^a Oa^/E Q i/3 (-*•) — Qi the new total complexity is ]\ i=1 ^1 ~ a{ J >>. Here 
a(Q) is a monotonically increasing function of Q, from a(Q — > 0) = to a(Q — ► 1) = 1, since a nearly fully 
constrained polymer has all its entropy on the surface, making the mixing entropy per monomer negligible in 
the thermodynamic limit. We introduce the form a(Q) = Q a with a a parameter deter min ed by the best fit to 
the lattice data (see fig.s and [13| and table | ). The route entropy appearing in eq. (18) then becomes [J43fl : 



M M 

5 ROUTE ({Qi}) = log [] V- {Q) = A (Q) ]T [-Qi In Qi - (1 - Qi) In (1 - Q t )) (4.11) 

A(Q) = 1-Q Q (4.12) 

The factor X(Q) measures the entropy reduction due to the coupling of chain connectivity with the native 
topology under study. The power a in A (Q) should be a decreasing function of the persistence length, and 
also of system size N, since for larger systems more polymer is buried and thus more strongly constrained by 



surrounding contacts. Fluctuations in contact probabilities Qi will lower the route entropy (see eq. (4.33) and 
also figure |l3| ). An alternative derivation for the route entropy in a protein is given in Appendix B. 

4-B.2. Bond Entropy 

The calculation of the total entropy lost due to contact formation is rendered difficult because the entropy 
loss of a given contact depends not only on the contact's sequence-length or bare loop-length £ i: but also on 
the configuration of contacts {Qi} already present when the contact is formed. In spite of this difficulty some 
general statements can still be made, as follows. 

If we make the assumption that the entropy loss to form contact i depends explicitly only on the sequence 
length of contact i, as well as the full contact pattern present {Qi}, then the most general form for the change 
in entropy due to contact formation, to go from configurations having one set of Q^'s parameterized in terms 
of a variable t, {Qi (t Q )}, to another state having {Qi (tf)}, is 

14 



s n 



({Qi(t f )}\{Qi(to)}) = J2 [' ' VQdt) 



Si(ti,{QAt)}) 



(4.13) 



Here Si (£i,{Qj (£)}) is the entropy loss to form contact i having sequence separation £i, in the presence of 
the contact pattern {Qj (£)}, which is itself parameterized through t [99]. Each Si (£i, {Qj (£)}) in eq. (4.13) 
is functionally integrated along the M-dimensional path specified byjQi (£)}. However the entropy as a 
function of the set {Qi} must be a state function, meaning that the value of the integral depends only on the 
end points and not on the path taken. The condition for path independence is obtained as follows. We can 
envision a small subsection of the M-dimensional path as traversing a hypercube of volume n»=i ^Qi- Then 
path independence means the entropy increment <S B ond ({Qi} I {Qi + $Qi}) is independent of the order the 
edges are traversed in going from {Qi} to {Qi + SQj}. Consider two possible paths labeled (1) and (2) along 
two of these coordinates {Qi, Qj}, as shown in fig. g. Along path (1), the entropy change to second order in 
5Q is 



5, 



(i) 



Qi+SQi 



YiS i (£i,Q' i ,Q j ) + 



i ' / &Q'j s j (^oiQi + SQu Qj) 

Qj 



SQ 2 ds SQ 2 ds- 

= s i \ii, Lji, (4j) Otji + Sj ("jl^i, Qj) OLjj H - \li, Lji, (4j) H — „ (tj, LJi, LJj) 

+ 8Qi5Qj^-(£j,Qi,Qj) 



(4.14) 



while along path (2) the entropy change is the same as expression (4.14) except that the last term is replaced 
by 5Qi SQj ^- (£i, Qi, Qj). For these two expressions to be equal 



||(MQ*}) = J^;(M^}) ^ i^j 



(4.15) 



For M dimensions, it follows that eq. ( 4.15 ) holds for all pairs (i, j), yielding M(M— 1)/2 nontrivial constraints 
on the form of the configurational ent ropy loss at each value of Q. 

When the entropy loss satisfies eq. ( 4.15 ), the total entropy difference only depends on the initial and final 
states and can be rewritten as 



Sbond ({Q{}\{Q°}) =E /„' d & 8 i(*i,{Qj}) ■ 



(4.16) 



Now we seek an approximate formula for Si that satisfies eq. (4.15). In forming a contact i from the 
unconstrained molten globule or coil state, the segment of polymer loses the entropy of a free chain with the 
length of that segment, s, (£i, {Qj} = {0}) = I n (a/l j) where a is a Q independent constant related through 
a sum rule to polymeric properties (see eq. (4.25)). However "zippering up" contacts formed in a nearly 
fully constrained polymer cost almost no entropy: Si (£i, {Qj} « {1}) = 0. To account for this we introduce 
an effective loop length £ E FF(£i,{Qj}) into Si(£i,{Qj}) = ln(a/^ EFF ) 3 / 2 . We ignore here possibly important 
changes in the power of the ideal chain exponent 3/2, s ince i t becomes cumbersome to incorporate an exponent 
dependent on {Qi} and to simultaneously satisfy eq. ( 4.15 ). 

Because of the path independence of the configurational entropy loss S BONr> ({Q i }\{Q°}), the change in 
entropy for a small change in one of the contacts Q\ — > Q\ + SQi is simply the integrand evaluated at the 
upper limit: 



dS n 



dQ. 



-({<#}!{<#})=* (4 



{Qi} 



(4.17) 



which can be shown from eqs 
In this paper we satisfy eq. (I 



Tq) and (4.16) by using the definition of the derivative. 
15|) with the following ansatz for the functional form of £ E 



W (U, {Qk}) = / (£i) g ({Q k }) = f iU) 9 




(4.18) 
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so that the loop length is decreased by a function of the mean of the contact density field, g(Q). This is in 
the spirit of the Hartree ansatz in the one-electron theory of metals, where electrons interact only through 
an averaged field. The condition £ EFF (£ i7 Q = 0) = £j gives / (£i) — ii and g(0) — 1. The condition that 
4ff (£%■> Q=l)«l gives <7(1) w 1/Z (since <?(<3) cannot depend on ii), where £ = (1/M) J^i ii- To approximate 
the Q dependence of i EFF we note that the probability of a monomer being constrained at Q is roughly Q under 
the assumption of a uniform contact probability. Then given a chain of unbonded monomers, the probability 
of it being length L is then pl — Q(l — Q) i_1 - So the average length of strings of unbonded monomers at Q 
is then L = ^2, L Lp^j ^2, l Pl — 1/ Q, w hich can be interpreted roughly as the total length of polymer N over 
the total number of bonds ~ NQ |10(| , or the total length over the total number of constrained monomers. 
We approximate the effective loop length at Q, i EFF (ii, Q), in the same way by dividing the total loop length 
ii by the number of bonded residues in the loop (or the approximate number of bonds in the loop) = IQ, so 
that finally 



Si(£i,{Q k }) «2 ln V' ( 



(ti,Q) 



(£-l)Q + l 



(4.19) 
(4.20) 



Note i EFF has the mean-field behavior for large ii and also has the right limiting behavior as Q — > and 
Q — > 1. Eq.s (4.20) and (4.19) are accurat e for weak dispersion in loop lengths; for larger values of Sii they 
must be modifi ed (s ee co mmen ts after eq. ( 4.29| ) ). 

Expres sion s ( 4.16 ) and J4.19| ) reduce to the Flory for m for the configurational entropy loss in the mean field 
limit 



100| , |l0ll when ii = £ and Qi = Q. Then eq. flL16p becomes 

a[l + (I-l)Q] 



3/2 



d(<9|0)= / dQM In 



(4.21) 



which can be interpreted as a summation of entropy losses from to Q: 



sZZ(Q\o) = E A $(Q')= E ln 

Q'=AQ Q'=AQ 

'fl(AQ) fl(2AQ) 



= ln 



fi(Q') 



fi(Q'-AQ) 

n(Q) 



n(0) fi(AQ) fi(Q-AQ) 

5 <MF) (Q) - 5 <MF) (0) . 



= 111 



n(Q) 
n(o) 



When £Q > 1, eq. (4.21) gives 



p(MF) 

^ (Q|0) 



3Q 



(lna-1 + lnQ) 



which is essentially the Flory result derived ea rlier in the mea n-field limit. 
In the presence of heterogeneity, equations (4.16) and (4.19) give 



(4.22) 



M 



M r Q 



5 BO nd ({Q t } |0) =-MQlna-E<9i ln ^ + E / d & ln 



M (g(lna)-^E^ ln ^-Q 



l-\ 



M 



rE^ 



£-1 



ln [1 + (£ - 1) Q] 



(4.23) 



where the last integral can be done by charging up each Qi one at a time (in any order) to its value at Q, i.e. 

J2j<i Qj + Q'i ) ■ This gives an expression identical to the mean-field 



1 



e-i 

M 



the integral is J2i Jo * ^Q'i ^ n 

result for this term, since the integrand only depends on Q and is integrated u p to each Qi(Q). 

Because the free energy of the native state F ({1} | {e^} , {ii}) is E N (c.f. eq. (h8)), all the polymer entropy 
is lost upon folding in the model. Therefore there is a sum rule for the entropy loss, 
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M „i 



5 BO nd ({1} |0) = V f dQi s t (£i, {Qj}) = -Nhxp = -Ns 



which, using eq. (4.23|), determines the coefficient a in the entropy of bond formation: 



\na{ V ,{l l }) = -^ + l + \nt-^— i 



Int. 



(4.24) 



(4.25) 



The coefficient a depends on the distribution of ti as well as the entropy per monomer s . Using (4.25) 
and ( |4.25| ), the final expression for the entropy loss arising from contact formation is 



Sbond {{Qi} |0) = S MF (QJ) - -M (SQ6\n£) 



where the first term in (4.26) is the mean-field entropy loss 



S MF (Q,l) = -QNs 



3„ r ^ ln £ 

-MQ= 

2 *l-l 



*M J- [1 

2 e-i L 



(£ - 1) Q] In [1 + (I - 1) Q] 



M (5QSln£) =^2(Qi~ Q) (in 4 - ln£) 



(4.26) 



(4.27) 



and the second term in ( 4.26 ) is the change in entropy loss due to fluctuations (again the notation (Xi) 
x = JlJ2i X i is used): 



(4.28) 



From ins pectio n of eq.s (4.26-4.28) we can c onfirm that S BOKU (Q = 0) = and S BONU (Q = 1) 
1Q > 1, ( p~27l) reduces to eq.s (|4.22|), ( gtgg ): 



-7V So . When 



S MF (Q,?> 1) « -Q7Vs + -zNQ]nQ 



(4.29) 



which has lost the information about the mean loop length and only retained information about the total 
chain length N, as in the Flory mean-field theory. The first term in (4.27) or (4^29Hs the loss in entropy to 
constrain a given fraction of the protein and is linear in Q. The remainder in(i.27) or (4.29) is the extra 
entropy loss this constraint induces on the remaining free parts by pinning down regions of the polymer chain. 
The analo gous qu antity in the capillarity theory is the surface entropy cost in forming a nucleus of folded 



structure ]84,102]. In capilla rity t heories, the surface entropy cost scales like N 2 ' 3 , whereas in mean- field 



theories it scales like N. Eq. (4.27) can be th ought of as a generalization of eq. ( 4.29 ) to finite mean return 
length J for a finite-sized system, and eq. ( 4.26 ) can be thought of as generalizing ( 4.27|) to include fluctuations 
in the return length. 



The effect of fluctuations in ( 4.26 ) is typically to increase the bond entropy of partially native states. 
The trend in the folding barrier with heterogeneity results from the interplay of this effect with the effects 
of fluctuations on the route entropy and native energetic fluctuations. The magnitude of the effect scales 
extensively with the size of the system. To illustrat e, re call that for a Go model the total entropy at Q is 
Ns + S noVTE ({Qi}) + S B ond({Qi(<3)|0}) (c.f. eq.s ( |4.6|) and (4J)). Thus if wc look at loops longer than 
the average (£i > I, and since the log function is concave do wn, l nlj > hxl), then they are less likely to be 
formed (c.f. eq. 4.35), so that Qi < Q and the second term in ( |4.26| ) is negative, thus raising the bond entropy. 
If £i < J, Qi > Q and the effect is the same. The halo entropy of the system Ns + S BOND ({Qi(Q)\0}) , or 
Flory entropy as we refer to it later, increases when we relax the condition that all contacts must be equally 
probable, and allow differences in contact probability based on their entropic likelihood (see fig. |l5| ). 

From ( 4.29 ) it can be seen that there is an entropy crisis (<S MF < 0) at values of Q < 1 when 2s D /3z £ 1. 
This is essentially because in the mean field approximation contacts are shared equally between residues; only 
one contact is needed to constrain a residue, however there may be more than one contact per residue. The 
increase in entropy from heterogeneity alleviates (but not necessarily eliminates) this problem. The route 
entropy described above in section (4 4.B 4.B.1) further increases the total entropy. However if there is a crisis, 



then at Q values higher than that where the entropy crisis occurs, the mean field description is no longer valid. 
Typical values of the parameters from off- lattice simulations of Chymotrypsin inhibitor or the a-spectrin SH3 
domain |3£|] give s = 3.4, z = 2.4, 2s D /3z = 0.94; here the entropy crisis occurs rather late in folding, if at all, 
because of entropy increase by the above-mentioned effects. At Q values above the crisis, fluctuations from 
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the mean-field contacts per residue must be accounted for. One way to achieve this is to switch to a residue 
representation for the entropy: the number of states is counted by considering the combinatorics of strings of 
residues which are frozen or melted out Bit]. 



On the other hand, eq. (4.19) breaks down for sufficiently large structural heterogeneity. Inspection of (4.19) 
shows that the entropy loss has the same derivative as a function of Q for all contacts, but the initial values 
are different. This leads to some problems with the shorter loops for high Q values, which it is worth noting 
as a word of caution here. The crude way in which the entropy loss for a loop is coupled to the degree of 
nativeness of the rest of the protein leads to a non-negative entropy loss to close some of the shorter loops near 
Q rs 1. We resolved this problem by actually truncating the entropy loss formula for the shorter loops when 
they reached a value of zero. Putting eq. (4.25) into (4.19), letting £Q ^> 1, and expanding to first order in 



5£/£ (weak dispersion limit) we obtain the approximate value of Q where the entropy loss crosses zero, namely 
Q v « 2s /3z + 5£i/I. When 5£i = this is consistent with the Flory analysis above, however when 5£{ < 
(shorter loops) Q v is decreased. We truncate the entropy formula at zero for Q > Q v . 

As a simple illustration, consider a structure whose loop distribution is given by first returns of a self- 
avoiding walk, P(£) « {3/2)£~ 5 / 2 , and thus from eq. ( p~19| j P{s) = exp(s - s{£ % = 1)), where s(£i = 1) = 
5/2 — (9/4) In 3 — s /z + (3/2) ln(l + 2Q) is the entropy of closure for the smallest loops (where the problem 
is the worst). Using the above values for s and z, all Si are negative until Q v ^ 0.76. Since the barrier peak 
typically occurs at Q values smaller than this, errors due to truncation would be small for these structures. 
On the other hand protein structures tend to have distributions with a wider dispersion than the random 
globule, and in these cases the problem would be worse. Applying the theory to the lattice structure of fig. [y, 
we must truncate the entropy loss for loops with £i — 3 at Q v s « 0.4 and for loops with £i = 5 at Q v § « 0.75; 
for all other loops there is no entropy crisis. Numerically there is some quantitative error introduced by this 
truncation, since in the theory these loops no longer contribute to the total entropy loss above Q v , whereas 
in the actual simulation they do. Of cour se, implementing a cutoff in loop entropy causes the total entropy to 
deviate from a state function by eq. ( 4.15| ). Theories of polymer entropy which take more complete account of 
correlations should remedy this and a re a topic of future work. For now we content ourselves with the Hartree 
style entropy formulation in eq. ( 4.19J ), implementing a cutoff if needed. In general however truncating doesn't 
qualitatively chan ge tren ds in the b arrier except possibly in pathological cases of limited relevance. 

(4.11) and (4.26) together give an analytic expression for the free energy for a fast-folding 



Equations 



protein which includes heterogeneity in the folding mechanism: 

F ({Q t (Q)} | {e,} , {£i}) = F MF (Q,e,£) + SF ({<5QJ|{<5eJ, {<%}) 



(4.30) 



where we've written the total free energy in ter ms of a mean-field term plus a fluctuation due to variations in 
energy, loop length, and contact probability. In (t4.30|) , F MF /M is the mean-field free energy per monomer |6l[| : 



F 



MF 

w 



eQ-T^-T^(Q,7)-T^ 

z M 



i2HIE(Q)_^l(l_Q) + JE. 

M W IT l ^' M 



(4.31) 



with iS MF given by eq. (4.27), and 5 RO ute given by eq. (4.11) with all Q. L — Q. The fluctuation in ( 4. 30] ) is 
given by 

^ ({6Qi}\{6ei}, {S£i}) = (5Q Se) + TX(Q) (q z 1b ^ + (1 - Q<) In i^\ + It (SQ S ln£) (4.32) 



Equ ation (4.31) contains 5 adjustable parameters which chara cteriz e the system: N,s ,z,b and E, and 
eq. (4.32) contains 1 adjustable parameter: a in X(Q) of eq. (4.12). Once chosen, the_se parameters are 



fixed for the rest of the analysis. We've chosen some values for the parameters in table | to compare with 
the lattice simulations. All other quantities such as e, £, S£ 2 , etc. arise from the structural and energetic 
distribution of a given protein at overall nativeness Q and temperature T. In our analysis we study trends in 
the thermodynamics by varying these distributions. 

As noted above, the free energy functional consists of an integration over a free energy density whose 
only information about the surrounding medium is through the average field present (Q): F = Y^JiiQi] Q) : 
Explicitly accounting for cooperative entropic effects due to correlations between contacts [[34], [45, 103, 104] 
would be an important extension of the model, and terms that lead to such effects have been introduced into 
the functional in similar models f34|,|45J|. 

We can make connection with the intuitive arg umen ts discussed previously by investigating the effects of 
heterogeneity on each of the three terms in eq. ( 4.32J ). As mentioned above, for longer loops the contact 
probability is expected to be less than average, and for shorter loops Qi is expected to be above average. So 
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relaxing the Qi values to accommodate this makes the third term in ( 4.32 ) negative, lowering the free energy. 
Also, sin ce th e fluctuation SQi is expected to be positive when a contact is stronger (Sti is ne gative), the first 
term in ( 4.32 ) is negative and the free energy is lowered. Lastly, the second term in eq. ( 4 . 3 2| ) consists of two 
terms inside the average which are both concave up, i.e. have a positive second derivative w.r.t. Qi. Thus the 
average of the terms is greater than the term evaluated at the average, i.e. 



«"!) >ffl)b, f 



= o 



(i- 



Qi)ln i^t) >( 1 -(^)) ln ^i 



i-(Qi 



Q 



= o, 



(4.33) 



and so the second term in (4732) is positive. Fluctuations away from uniform ordering raise the terms in the 
free energy due to route entropy. This effect competes with the two lowering effects above. To find which terms 
dominate, we find the functional dependence of the contact probabilities Qi on the energies Ci and entropies 
Si in the next subsection, and then investigate the trend on barrier height in section pi 



4.C. The most likely distribution of contact probabilities 



Equations ( 4.3C ), ( 4.31 ), and ( 4.32 ) describe the free energy for an arbitrary distribution of contact probabili- 
ties {Qi(Q)}, subject only to the constraint that the average probability (Qi) is Q. The most likely distribution 
{Q* {Q)} 0I the contact probabilities Qi(Q), i.e. the thermal distribution, is obtained by minimizing the free 
energy F ({Q % (Q\ {ej , {^})}) subject to the constraint J2i Qi (Q) = M Q> i- c - $( F + [ l Y,j Qj) = °> or 



E 



^ ({Qi} I {*},{*<}) + m 



SQi = 



(4.34) 



for a rbitr ar y and indep ende nt variati ons S Qi (c.f. the analogous expression cq. ( |2.5| ). Substituting 
eqs. ( 4.30 ), ( 4.31 ), and ( 4.32 ) into eq. ( 4.34 ) yields a Fermi-Dirac distribution for the most probable ther- 
modynamic occupation probabilities {Q*} for a given {e^} and {(-i}'- 



Q*(Q,{e i },{4}) = 



1 + cxp [^ (li+ei- Tsi {£,, Q) - TX> 



&)]' 



(4.35) 



where A' = dX (Q) jdQ (c.f. eq. |I| ) and 



(Q) 



-Qj In Qj ■ - (1 - Qj) In (1 - Q 3 )) . Thus each 



probability Q*, referred to below simply as Qi, is a function of all the {Qj}, and must be solved for self- 
consistently. Non-native ruggedness i ntrod uces a term with anomalous 1/T 2 temperature dependence in the 
distribution. By the structure of eq. (4.35), all contact probabilities Qi are between zero and one. 

The Lagrange multiplier \i is determined by the constraint ^2 li Q* = MQ, and so is a function of Q and the 
distributions of {ei} and {£i}- It can be interpreted as proportional to an effective force along the Q coordinate 
since 



1 dF 
fI ~~MdQ 



(4.36) 



by the properties of the Legendre transformation (see Appendix A). Thus again since the free energy F is of 
course a function of Q and the distributions {ej} and {£i}, \x = —(l/M)dF/dQ is also. 

The second variation of F ({Qi} | {fj} , {d-i}) (neglecting terms of order O (1/M)) is indeed positive 



8 2 F 



XT- 



>0, 



dQjdQi Qi(l-Qi) 

verifying that the extremal values of Qi are the ones which minimize F ({Qi} \ {fi} , {^i})- 



(4.37) 



5. CHANGING FOLDING MECHANISMS BY TAILORING NATIVE INTERACTION ENERGIES 

AND ALTERING NATIVE STRUCTURAL MOTIFS 

Most single domain proteins most fold over a free energy barrier of a few k B T at the transition temperature. 
This barrier is small compared to the total thermal energy in the system, reflecting the exchange of energy 
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for entropy as a protein folds |61 , 105 1 . However the barrier height can vary significantly depending on which 
parts of a protein are most stab le in t he native structure, i.e. how the native energy is distributed throughout 
the native structure. In section 5 5. A we look at the effects on the thermodynamics when native interactions 
are changed in a controlled manner. We find that a distribution of native energy which induces a uniform 
folding mechanism will maximize the barrier. For model systems of small proteins this barrier is about twice 
as large as the barrier when native energies are uniformly distributed. Increasing heterogeneity in the folding 
mechanism systematically decreases the folding barrier and may eliminate it entirely, at least in the absence 
of cooperative inter action s. The corresponding folding rate increases, as long as the protein remains well- 
designed. In section 5 5.B we develop a perturbation expansion of the free energy to incorporate structural as 
well as energetic heterogeneity, and the effect on the free energy of correlations between them. In |5 5.C we 
further apply the functional theory to calculate various thermodynamic quantities and compare the results 
with simulations of the lattice 27-mer, and in 5 5.D we co ntinue the comparison by applying the functional 
theory to properties of the folding mechanism. In section 5 5.E we show how the folding barrier decr eases 
with the degree of route-like folding in the system, so long as the protein remains well-designed. In 5 5.F 



we explicitly investigate the kinetics of folding times in the system and find that folding kinetics is well- 
characterized by the thermodynamic folding barrier. In p 5.G we illustrate the effect of contact order or mean 
contact length on the folding barrier in the model. In 5 5.H| we investigate the effects of structural variance 
on a hypothetical ensemble of well-designed protein fold motifs. We find that for fixed average loop length 
J, native structures that have larg er di spersion S£ in the distribution of return lengths tend to have smaller 
folding barriers. Finally in section 5 5.1 we show how native energies can be tuned or native structures can be 
sought to match a desired free energy potential, which we illustrate for a simple two-state potential as well as 
a potential with an on-pathway intermediate. 



5. A. Energetic heterogeneity for a given structure 



First we consider the free energy as a function of Q and the fiel d of energies {fii}, given the fiel d of l oop 
lengths {£{}■ Each contact probability Qi in the free energy (eq. 4.32J ) is considered through eq. (4.35) to 
be a function of Q, its energy, its loop length, and the Lagrange multiplier fi(Q, {ej} , {lj}), which is itself a 
function of Q and the distributions {ej} and {£j}. Thus the free energy depends both implicitly and explicitly 
on {ei}. 

We now seek to relax the values of {ei}, at fixed stability (fixed total native energy) 



3 



(5.1) 
(5.2) 



to the distribution {e* ({£j})} that extremizes the free energy barrier. Under variations of the energies {5ei} 
for a given structure {£i}, the free energy becomes 



F{t io + 8t i } = F{e io } + Y J 



5F 
fie. 



Sc; 



-y 



2! 



» ,3 



6 2 F 



Set 8e-j + 



(5.3) 



where 6/Sei is the total derivative with resp ect to q. So the distribution {e* ({^j})} that extremizes the free 
energy barrier subject to the constraint eq. (5.1) is the solution of 6(AF$ — p^], ej) = 0, or 



X 



5AF* 



8a = 



(5.4) 



for arbitrary and independent variations Set in the energies. The Lagrange multiplier p imposes the constraint 
that the total native energy E N is constant. Changes in the barrier height are roughly equal to changes in the 
free energy at the barrier peak, since the free energy in the unfolded state Q ~ is more weakly dependent 
on {e^}, i.e. SAF^/Sci = 5F(Q*)/Sei, because 5F(Q « 0)/Sei = 0; less native structure is present in the 
unfolded state. The effect on the free energy of perturbations in {e^} is largest at intermediate Q; there is no 
effect at the end points because at Q — there arc no native interactions, and at Q — 1 all native interactions 
are present and must add up to the total native stability E N , which is fixed. In fact in the equations for the 



20 



free ener gy pe rt urbat ion thi s effe ct is manifested by the factor of Q(l — Q) which multiplies every term, see 
e.g. eq.s (5J-SJ), (^23|), and (Ejgt). 



Because of the implicit functions mentioned above 





<9F y, dF dQj y, dF dQ 3 d/j, 
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dQj dQj dfi 
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(5.5) 
(5.6) 



However the term in square brack ets is just the total derivative SQj/5ei, so the sum vanishes because Q is a 
fixed parameter independent of e, |106| : 



Differentiating eq. 
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immediately yields: 
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(5.7) 



(5.8) 



so the perturbative change in the free ene rgy barrie r by varying a contact's energy is equal to the probability 
that contact was formed at Q* (c.f. eq.s (3.2) and (3.19) ). 

This is strongly related to experimental <f>i values, which measure the change in the log rate af ter m utation 
over the change in difference in equilibrium populations of the folded and unfolded states [p2, 107 ] . When 
the prefactor to the rate is unaffected by the mutation this is equivalent to the change with mutation in the 
barrier height over the change in the difference of the free energy minima pQ,p2j, which we refer to as <j>'\ 



(dFt/dej) - (dFjckj) _ Q r (Qi) - Q t (Q u ) 
(dFf/dei) - (dFu/da) Q % (Q P ) - Q % (Q u ) 



(5.9) 



When the nativeness in the unfolded state can be neglected, Qi{Q 
folded state are essentially fully formed, Qi(Q F ) ~ 1. Then eq. (|5 
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Q l {Q t ) 



rj 0, and when the native contacts in the 
becomes 

(5.10) 



Equating <f> values with contact probabilities assumes that contact probability be used as a kinetic reaction 
coordinate. In fact it has been observed in simulations that <f> values correlate with Qi values as well as any 



other reaction coordinate currently proposed [108 



Continuing now to find the energies e* which extremize the free energy, eq. (5.4) gives finally: Qi(Q*,fj} = 
0, e*, £i) = p: the free energy is extremized when all the Qi values are tuned to the same number at the barrier 
peak. This folding scenario is that of a symmetric funnel: the protein is equally likely to order from any place 
within it. Thus since ^2 t Qi = MQ, 



Solving eq. (5.11) for the energies using eq. ([4.35) gives 



< = T Sl + T^(\ [-Q InQ - (1 - Q) ln(l - Q)])^ - ^ . 



(5-11) 



(5.12) 



Subtracting e from a by averaging eq. (5.12) yields: 

e* - e = T (si - s) 



-T(ln£i-ha) 



(5.13) 



where eq. (4.2C) was used. The free energy fluctuations Sfi — in a uniform folding mechanism. Thus contacts 
pinching off longer loops (£i }t li) have lower (stronger) energies (ej < e7) to make all the contact probabilities 
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equal at the barrier peak [g_09| . If correlations between contacts are fully accounted for, the Qi values deviate 
slightly from Q away from the barrier peak, but the fluctuations away from uniform ordering are still strongly 
suppressed (see fig. |lJC ). 



Evaluating the second derivati ve s tability matrix in eq. (5.3) shows Qi = Q$ in eq. (5.11) to be an unstable 
maximum, as follows. From eq. ( |5.8| ) 
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(5.14) 



by eq. (4.35). Thus the second order change in the free energy at the extremum is 
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(5.15) 



Since the perturbations 5ei are independent, cross terms in the double sum of eq. ( 5.15J ) vanish, making the 
sum equal to 
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(5.16) 



This term is negligible for the following reaso ns. First, note that dF/ dej = 
dix/dti = -(l/M)(d/dQ)(dF/de^by eq. (p6|) , the terms dfi/de, in eq. (PI) 
of M terms in ( 5.16| ) is ~ O (1) Se 2 , whereas the first term in eq. ( 5.15 ) is 
thermodynamic limit. Thus to order 0(1/ M): 



Qi is ~ 0(1). Then since 

are ~ O (1/M). So the sum 

O (M) Se 2 and dominates in the 
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(5.17) 



which is clearly negative, meaning that tuning the energies so that Qi — Q$ maxim izes the free energy at 
the bar rier peak. This is consistent with the intuitive arguments in section 3 3.B. Recall for example in 



section 



3 3.B3.B.4 we showed using thermodynamic perturbation theory that random perturbations to the 



contact energies alw ays lo wered the fr ee en ergy b arri er. 
Substituting eq.s (5.8), ( |5.1l| ), and ( 5.17 ) into (5.3) gives 



AFi{e* + Se t }^AFl F -M 



Q*(1-Q*)t3 



2\tT 



6c 2 



(5.18) 



It should be noted this expression is very similar to eq. ( |3.5D obtained using a simple REM argument, and 
also to eq.s ( |3.17 ) and ( [3. 23 ) using thermodynamic perturbation theory. The only major difference here is the 
factor of A* which accounts for the reduction in route entropy due to chain connectivity. 

For an energetic standard deviation of about a k B T from the optimal distr ibutio n, the barrier goes down by 
about - Nk B T/2 (we've let M « 2N, A* « 1 - Q i since the exponent a in fl4.12| ) is about 1, and Q x w 1/2). 
The barrier governed rate increases with native energetic heterogeneity as 
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exp I Q*(l - 0*)^^ 



(5.19) 



which should be compared with eq. (3.6) 



5.B. Including structural heterogeneity 

The theory also allows us to investigate the effects of native structural variance on the barrier, as well as the 
correlations between structure and energetics. A perturbation analysis shows that structural variance lowers 
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the barrier as discussed in section 33.B3.B.3 , and that entropically likely contacts should be made stronger 
to lower the barrier, as discussed in section 3 3. A. In the model, entropically likely contacts are short-ranged. 
However they may occasionally be long-ranged when entropy is more precisely accounted for by accurately 
accounting for correlations between contacts (see figure Hq) . 

Consider perturbing the free energy of a homogeneous system with £i — £, e^ = e, Qi — Q^, by letting 
£i = £+ Sit and & = e + 5e t . Then, 
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(5.20) 



The first term in the expansion AFjJp {e, £ \ is the mean-field free energy eq. (|4 . 3 1| ) . The second term is zero 
at the extremum where Qi — Q$ by eq.s ( |5.8|) , ( 5.11 ) and ( |5.2| ), and the fourth term is giv en in eq. ( 5. IS ). 
The calculation of the third term proceeds along the same lines as the derivation of eq. ( |5.8| ). Like eq. (5.6), 
SAF/5£i contains a term involving an explicit derivative of £i, and implicit derivatives which are identically 
zero. The explicit term itself vanishes when evaluated for homogeneous fields. From eq. (4.32): 



[SAFi\ 



z rT ,(Q l -Q x 



-T 



£i 



0. 



(5.21) 



Calc ulati on of the fifth term involves calculating SQj/S£i, which proceeds analogously to the derivation of 
eq. J5T7|) via (gig): 
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v 2 XL 



(5.22) 



which is again diagonal and negative as is eq. ( 5.17 ); raising the energ y of a cont act or increasing its loop 
length decre ases that contact's probability of formation. From eq.s (5.21) and (5.22), the fifth and sixth terms 
in eq. ( [5.20] ) can be calculated, yielding 
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(5.23) 



The third term in eq. (5.23 ) indicates that structural dispersion also lowers the barrier, in agreement with 
the arguments given in sect. 33. B 3.B.3 . The fourth term indicates that the free energy barrier is additionally 
lowered in the model when shorter range contacts become stronger energetically (5£{ < and Sei < 0) or 
longer range contacts become weaker energetically (<% > and Set > 0). This means in general that the 
free energy is additionally lowered when fluctuations are correlated so as to furth er inc rease the variance in 
contact participations, in agreement with the intuitive arguments given in section 3 3. A. Note again that all 
reductions in free energy due to structural and/or energetic heterogeneity scale extensively with system size. 



5.C. Illustrations using a lattice model protein and functional theory 



We illustrate in figures B|-ll5l the general effects of introducing heterogeneity to a model system. Results 
are shown for a Go protein, modeled with the free energy functional theory using reasonable parameters and 
loop-length distribution given in table I, and also simulated on a lattice (see also reference Q for a concise 
treatment of some of these phenomena). The lattice protein is a chain of 27 monomers constrained to reside on 
vertices of a three -dimensional cubic lattice (see fig. if). Details of the model and its behavior can be found in 
ref.s 
only) 



11C -112]. Monomers have non-bonded contact interactions with a Go potential (native interactions 
Corner, crankshaft, and end moves are allowed. Free energi es an d contact probabilities are obtained 
by equilibrium Monte Carlo sampling using the histogram method [111]. Sampling error is < 5%. Coupling 
energies were chosen by first running a simulated annealing algorithm to find the set of native energies {e*} 
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that makes all the contact probabilities equal at the barrier peak: Qi({e*}) = Q^ (c.f. table I for the energies 
which induce uniform folding for the native structure in fig. tth. Energies are always constrained to sum to a 
fixed total native energy: J^. e, = Me, i.e. overall stability is fixed. Energies are relaxed from {e*} by letting 

€ t = e* + a{e~4). (5.24) 

For the simulation results, the values a = 0, 1, 1.35, 2.05 were used in the figures, and for the theoretical 
results, the values a = 0,0.5,0.65,1.9 are used. When a — 0, {e^} = {e*} and all the contact probabilities 
are the same at the barrier peak; the folding barrier is maximized. When a = 1, all the energies are equal 
({ei} = {e}); the entropy at the barrier peak is maximized. When a w 1.35 (a ~ 0.65 in the theory) the 
free energy barrier vanishes, and when a ss 2.05 (a « 1.9 in the theory), folding occurs by only a few routes- 
essentially one route through the transition state. The discrepancy between the theory and the simulation 
for the values of a requi red t o produce a given folding mechanism is most likely because the theory for loop 



closure, eq.s ( 4.19 ) and ( 4.2C ) overestimates the dispersion in entropy losses for contact formation. 

In the figures to follow we compare thermodynamic quantities and folding mechanisms for various energy 
functions, consistently using the following convention; thin solid: a S i M = 0, a TH R Y = 0; thick solid: a SIM = 1-0, 
«thry = 0.5; long dashed: a S i M = 1.35, cuthry = 0.65; short dashed: a S i M = 2.05, q T hev = 1-9. 

5.C.I. Total Energy 

Figure || shows a plot of the total energy as a function of Q obtained from lattice simulations of a Go model 
to the structure shown in fig. II. When native energies are uniform, the total energy decreases linearly with 



Q, since the total energy (eq. ( J4.5| ) with b = 0) is J2i Qi e i = e J2i Qi = E n Q. Whe n ene rgies are tuned so that 
contact probabilities at the barrier peak are all equal (Q^e*,^) = Q\ see eq. ( J5.ll ) ), the total energy at 

t.ViP Kflrripr rtpak nnsitirm is pnnal tn ttip total pnpro'v wVipn c : = Z Qinpp N\ O-I/OTW- = C)\ \\ c ■ wtiipln flccflin 



the barrier peak position is equal to the total energy when ej = e, since Yli QdQ_J^i = Q^ ^2% e * which again 



equals E K Q^ . So as the coupling energies are relaxed from {e*} to {e} in eq. ( 5.24 ), the energy at the barrier 
peak hardly changes, as can be seen from the figure (shifting of the peak position is a small effect). Further 
perturbing the coupling energies results in an overall decrease in total energy. Temperatures are adjusted to 
T F for all curves in fig. 0, however T F is roughly constant for the upper 3 curves (see fig.s O] and pi]); for the 
case when the energies are tuned to induce route-like folding: {e^} = {ef V, the folding temperature is about 
a factor of 6 lower. While heterogeneity effects on the total energy in fig. H and entropy in fig. uu below may 
appear fairly small, the effect on the barrier and corresponding rate may be large, since the barrier arises from 
the cancellation of these two large terms, and the rate is proportional to the exponential of this difference. 

5.C.2. Total Entropy 

Figure nffl shows a plot of the total entropy as a function of Q from simulations of the Go model mentioned 
above. When the native energies are all the same, {e{\ — \ t\ (ma ximal solid curve in fig. [To] ), the entropy 



is larger than for any other distribution as shown in section 3 3.C. We derive this result now using the free 
energy functional theory. 

First, the total entropy at Q depends on the relative occupation probabilities and so does not depend 
explicitly on the total energy. One can see this by taking the derivative: 

dS dS 1 v- dS 



^e!!=o <^> 



since the entropy does not depend explicitly on the coupling energies (dS/dei = 0). The last equation follows 
from the properties of the directional derivative (c.f. Appendix A). This independence of canonical entropy 
on total energy is analogous to the property that the Lagrangian for a system of particles not under any 
external forces is independent of the center of mass coordinate. One can solve Newton's equations without the 
constraint and obtain a zero frequency normal mode corresponding to the whole system in uniform motion. 
Keeping the constraint introduces a new equation and unknown (the Lagrange multiplier), and solving the 
eigenvalue problem yields that the multiplier is zero. Another example is a thermal bath of photons, which has 
no constraint on the total number of photons. Thus the Lagrange multiplier for this con straint- the chemical 



potential, is zero. The Lagrange constraint here appears in the form p^2 i £i as in eq. (5^), which has the form 
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of a Legendre trans form with the Lagrange multiplier p being proportional to dS/dE N , just as the Lagrange 
multiplier fi in eq. (4.36) is proportional to dF/d(MQ). So the multiplier p is identically zero. 
We take the extremum of the entropy, now explicitly introducing the constraint that Q is fixed: 



S + u^Qi 



E 



SS SQj 

h v 

oei dei 



da = 



(5.26) 



for arbitrary and independent variations Je^. Since the entropy does not depend explicitly on e, 

5c. 



SS y, OS 5Qj 



dQj 8e l 



Using S={E- F)/T, and eq.s flL5j) and (|4.34|) , 



dS ej [i 
dQ~ = T + T 

Now by the properties of the Legendre transform, 

J_dS_ 



see the analogous eq. (A. 3). Again using S = (E — F)/T, eq. 
eq.(A.5) ), we obtain 



T T 



Putting eq.s ( EJ28 ) and ( |3.30 ) into eq. (5.26) gives 



E 



T 



5e t 



= 



and using eq. (|5.14| ), i.e 



eq. (5.31) becomes 



5Q 1 __, Qj (1 - Qj) 
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(5.27) 

(5.28) 

(5.29) 
|), and the directional derivative on E (c.f. 

(5.30) 
(5.31) 

(5.32) 



(5.33) 



whose only solution is et = e, i.e. all the coupling energies are equal. Taking the second derivative yields 



5 2 S 



= -St 



Qi(i-Qi) 

XT 



(5.34) 



which is negative indicating the extremum is a maximum. Thus the thermal entropy is maximized for uniform 
coupling energies, for proteins well-designed enough to be m odele d by Go-like models. 
Keeping the Lagrange constraint J^ t% = E N modifies eq. (5.33) to 



5e t on + p = 1 < i < M 



(5.35) 



where at = Qi(l — Qi)/XT andp is the undetermined multiplier. Equations (5.35) constitute M + 1 equations 
for M + 1 unknowns (the Sti and p) . The solution then amounts to finding the determinant of the matrix 
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(5.36) 



which is readily evaluated, and consists of M negative terms, which are all possible ways to pick M — 1 of 
the M ai's, e.g. if M = 4, the determinant is — aiQ^a^ — aio^o^ — aia^a 4 — a2a3<24. Thus the determinant 
is never zero, so there is no solution other than th e triv ial solution, where all Set — 0, which is again the 
condition that e.; = e. Putting this condition in eq.s ( 5.35| ), we again recover that that the Lagrange constraint 
p = 0. 



5. C. 3. Free Energy 



Figure [il| shows the total free energy F(Q) in units of e for the simulations and functional theory, for various 
energy functions. For the simulations F(Q) is obtained from E(Q) — T F ({ei})S(Q) where E(Q) is shown in 
fig. (ffl and S(Q) shown in fig. (|10|). The transition temperature T F ({ei}) is defined as the temperature givin g 



a native (Q = 1) occupation probability of 50% (see fig. 



For the theory F(Q) is obtained from eq.s ( 4.30| ) 



(4.31), and ( |4.32| ), and the transition temperature T F is defined as the temperature where the probability for 
Q > 0.9 is 50% (the transition temperature is not strongly dependent on this cutoff). For e* (thin solid curve) 
the folding barrier is maximized, see eq. (|5.17). Here the profile is largely a function only of the mean loop 



length £ and system size N (see eq. ( 4.31[ )), since most structural features are tuned away by adjusting the 
native energies. The only other residual structural fe ature s which remain are the effects of native structure on 
the exponent a in the route entropy reduction, eq. (4.11), which is probably a small effect compared to the 
effects we consider here. For uniform native energies, the transition state energy hardly changes as explained 
in section 5 5.C 5.C.1, but the entropy increases to its maximal value as explained in section 5 5.C 5.C.2. So the 
barrier height initially decreases for entropic reasons (see also fig.s [12] and [17|). For the lattice model the barrier 
is reduced by about a factor of 2, (thick solid in upper plot) and for the theory the barrier disappears entirely 
at the transition temperature, the curve residing between the long-dashed and short-dashed curves in the lower 
plot. The thick solid curve in the theory plot is for about half the energetic dispersion required to tune the 
system to homogeneous folding; this results in the same reduction in folding temperature as the simulations. 
Further perturbing the energies to {e°} eliminates the barrier entirely at the transition temperature making the 
transition second order (long dashed curves). For the simulations, between {e} and {e°} the barrier decreases 
because the energy at the transition state lowers while the entropy doesn't change to first order. In the theory 
there is sufficient entropic variance to kill the barrier before relaxing all energies to a uniform distribution { e} , 
however {e{\ for the long dashed line in the theory has a relative variance Se/e of only about 0.15 compared 
to about 0.47 for the maximum barrier. Coo perative effects p4, pC 62 or entropic reduction arising from the 
coupling of chain stiffness with folding [114] restores a barrier for uniform native energies. As energies are 
further perturbed to a distribution {ef} causing folding to occur by a single dominant route (short dashed), 
folding becomes strongly downhill at the the transition temperature, which drops sharply by about a factor 
of 6. For a system with non- native as well as native interactions, the free energy would be less downhill and 
much more rugged at these temperatures. Folding would be exceedingly slow because the protein would spend 
a long time in individual traps. 

Figure [12] shows the total barrier height AF*, in units of the mean native contact strength e, vs. the RMS 
native energetic variance in the same units. Plotted are results from simulation of a lattice 27-mer Go model; 
the analytic theory gives qualitatively the same results (see fig. |17|). Native interactions {e{\ are tuned to a 
distribution {e*} that symmetrizes the funnel to uniform ordering at the barrier peak, giving the largest barrier 
at the upper right of the figure (see eq. 5.17). The energies here are anticorrelated with their loop l ength s 
in that more negative energies are required for the longer loops to have equal free energies (see eq. |5.13|) . 
A large dispersion in interaction strengths is required to achieve this scenario, wh ich m ay be impossible to 
achieve in practice (see table I). Interactions are then uniformly relaxed via eq. (5.24) to {e^} = {e} (i.e. 
Set = 0) and t he ba rrier is reduced by a factor of 2, due to entropic gains in the transition state ensemble 
(TSE) (see eq. ( 5.34 )). The ti are then continu ed to evolve - now Set begins to correlate with loop length and 
anti-correlate with contact probability (c.f. eq. 5.24 ), which further increases the heterogeneity. The barrier 



decreases now because energetic gains win over entropic losses in the TSE, and eventually the barrier vanishes 
at coupling energies {e°}. We should emphasize here that {e°} is not unique; many sufficiently heterogeneous 
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distributions of native stability may kill the barrier. All the while the folding temperature T F (dashed curve) 
defined here as the temperature where the native state at Q = 1 is 50% occupied, decreases by only 10% and 
remains well above the putative estimate of the glass temperature T G = T F /1.6 (thin horizontal dot-dashed 
line at lower right). The fact that T F /T G w as r oughly constant as native inte ractio ns were varied indicates that 



the prefactor to the folding rate, k in eq. (1.2) is roughly constant (see sect. 55. F ). Folding rates are governed 
largely by the free energy barrier in this regime (see fig. [fq below) and can be treated with thermodynamic 
analysis. Because the entropy of the bottleneck is not constant as 5e/e is varied, another valid measure of 
T F /T G for this system is T F times the root entropy per residue at the bottleneck (thin solid line with squares). 
This measure actually shows a slightly increasing trend as e^ — > e and only slight decrease as {e} — > {e°}. 

5.D. Folding Mechanisms: Route Entropy, Halo Entropy, and Contact Probability 

We continue by investigating different possible folding mechanisms to the same structure, induced by dif- 
ferent distributions of native stability. So far we have seen that in Go- like models of well-designed proteins, 
governed by an energy function with pair interactions, the folding mechanism to a given structure may in- 
volve a barrier of various heights (fig. O) while the total entropy and energy are relatively unaffected (figs. 
and |lC|) , depending on how native interactions are distributed throughout the protein. The barrier is sensitive 
to perturbations because it arises from the cancellation of large terms: the total entropy and energy [pl|, 105] 



We have seen so far that it is quite difficult to tune away all the effects of native topology, or induce folding- 
through one or a few routes. Now we investigate some of the properties of partially native states when folding 
occ urs by various mechanisms. We again vary the distribution of native energies by varying values of a in 



eq. (15.24) 



Figure 13 shows the route entropy <S RO ute over the number of contacts M as a fu nction of Q, for the 27-mer 



lattice model (Top) and for the functional theory (Bottom). In both cases eq. ( 4.11 ) is used to calculate 
the mixing entropy from the contact probabilities {Qi}. For native contact distribution {e*} which induces 
homogeneous folding through the barrier peak (thin solid), the curve essentially reproduces that in fig. [7J. 
This distribution maximizes the route entropy. For a uniform distribution of native energies (Top figure, thick 
solid) there is a reduction from the homogeneously ordering case solely due to the topology of the native 
structure. Different native structures will have different characteristic curves. For the theory curves, the same 
native energy distributions used above in fig. O are used here. For interactions {e°} which kill the barrier, 
the route entropy is further reduced (long dashed). The upper 3 curves in both plots are funneled folding 
mechanisms with barrier heights varying from their maximum to zero. The bottom curves (short dashed) are 
route entropies for a folding mechanism involving a small number of routes. If 5 RO ute = 0, only one route 
to the native state is allowed, because then all contact probabilities Qi are only zero or one at any degree of 
nativeness, and if this is the case each successive bond made in folding must always be the same one at each 
and every degree of nativeness. It is interesting that in both theory and simulation there are fluctuations in 
the route entropy. These correspond to the necessary fractional values of contact formation probability present 
when a particular contact begins to be formed, as long as contact probabilities Qi go from zero to one over a 
finite width AQ (see fig. |il| ). 

Figure H4 shows the contact formation probability Qi(Q) as a function of Q for a ll contacts. The top row 
is the lattice simulation result, the bottom row from the analytical theory, eq. ( 1.35J ). The left column is for 



a route-like foldi ng sc enario, the right column for a uniform folding scenario at the barrier peak. Energies are 



chosen from eq. (5.24) with values for a shown above each figure. There are less curves in the analytic theory 
because contacts with the same loop length (which have the same energy) have the same Qi{Q). In the route- 
like folding scenario, contacts remain unformed until their free energy of contact f orma tion /j = ej — Tsi(Q) 



equals —fi{Q) = (!/M)d F/dQ , for larger Q they are then essentially formed (see eq. (|4.35| ) with s ROUTE = and 



b = 0). Here /x(Q) in eq. (4.35) serves as a chemical potential for each contact. Fluctuations in route entropy 
for a route- like folding scenario can be understood through the behavior of contact formation probabilities. 
For this folding mechanism contact probabilities are approximately either zero or one. But as Q increases, 
new contacts must be formed and so their formation probabilities must go from zero to one, over some finite 
interval AQ for any finite sized system with a reasonable distribution of native energies. In this range AQ, 
the contact (s) increasing from zero to one have fractional occupation probabilities and so contribute to the 
route entropy, causing the bumps seen in figure [b|. In the uniform folding scenario, all Q%{Q^) ~ Q^ at the 
barrier peak position. They deviate away from Q away from the barrier peak in the lattice model because of 
the nontrivial dependence of the loop entropy on nativeness. Deviations away from Q in the theory are due 



to the cut-off introduced into the entropy function mentioned above, so that eq. (5.13) is not strictly valid 
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When contact probability is governed by native topology, as in figure |2|, the variance in Qi values is in between 
the two extreme cases shown here in figure 14 



We can make a connection with earlier polymer theory by investigating the total entropy minus the route 
entropy 5 FLORY = Ns + 5 BON d (see fig. |l^). This can be thought of as the residual entropy of the effective 
halo dressing the partially native structures, and is analogous to the polymer entropy in a cross-linked system 
studied by Flory and others |100 , 115, 116fl . As mentioned in the comments after eq. (4.29), fluctuations away 
from uniform folding raise the bond entropy and thus the Flory entropy, as can be seen in figure [15| for both 
simulations and theory. For the unifor m fold ing scenario, there is a value of Q v < 1 where the Flory entropy 
runs out (see again comments after eq. ( 4.29J ) ) . In a uniform folding scenario above Q v , all residual entropy is 
combinatoric- if MQ V crosslinks were permanently made with equal probability anywhere in the protein, there 
would be no entropy left in the system. The mean-field prediction gives Q v ~ 2s /3z « 0.87. The lattice value 
of Q v w 0.65 is considerably lower probably because ideal chain statistics used in the theory overestimates the 
entropy actually present in a partially constrained lattice polymer of length N = 27. The folding mech anism 
having few routes has essentially the largest Flory entropy here because following the recipe of eq. ( 5.24 ) 
weights a native core with naturally large halo entropy. Weighting other cores would in general not give the 
same result. 

We continue the comparison between theory and simulation by plotting several qu antiti es in figure Hq, for 
a folding mechanism with intermediate barrier height (a S i M = 1-0, a THRY = 0.5 in eq. (5.24)), and for energies 
which induce a uniform folding mechanism in fig. ll6|C. Plot ll6|A shows the correlation between theory 



and simulation of contact probability at the barrier peak Qi{Q^) given in the theory by eq. (4.35). Contacts 
with the same loop length have the same Qi value in the theory but not in the simulation (due to entropic 
differences even for contacts having the same loop length) . However the average of the simulational contact 
probabilities (squares in fig. |l6|A) correlates well with the theory (correlation coefficient = 0.93), indicating 
that the theory captures the average trend, but there are potentially important many-body effects in the 
calculation of the entropy loss upon contact formation. Figure [lqB shows the trend in contact probability 
with increasing loop len gth, b y replottin g the theor y and data against log loop length. The solid line is the 
theoretical result of eq. ( 4.35J ) with eq.s ( 4.19 ) and ( 4.20 ). Again the data fluctuate significantly about this 
curve, but the theory captures the trend in the average values (squares) (as before, correlation coefficient 0.93). 
Figure |16|C shows the results of tuning the energies to induce uniform folding, for theory and simulation. The 
solid line is the result of equation ( |5.13 ): Sei = T F Ssi. The data points with error bars are the values extracted 
from lattice simulations. Again the theory captures the overall trend (correlation coefficient 0.71), but there 
are still significant fluctuations about the average. 



5.E. Measures of routing 



Since the free energy barrier is maximized for a uniform funnel folding mechanism (eq. 5.17), we expect 
the barrier height to be decreasing function of the dispersion in Qi values at the barrier peak SQ 2 (Q^) = 
({Qi — Q^) 2 )- Let us introduce a measure of "routing" TZ(Q^) through the bottleneck by the function 



K(Q) = 



(SQ 2 ) (SQ 2 ) 



(SQ 2 ) 



g(i-o)' 



(5.37) 



The denominator is the most route-like the system can get at Q, i.e. if MQ contacts were made with 
probability 1 and M - MQ contacts were made with probability 0, then ((Qi - Q) 2 ) = (1/M)(MQ(1 - Q) 2 + 
(M — MQ)Q 2 ) = Q(l — Q). Thus 7Z(Q) is between and 1. 1Z(Q) is proportional to the lowest order 
correction to the route entropy (4.11) when fluctuations SQ are present: 
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(5.38) 



In the (non-perturbative) limit 1Z (Q) — 1, 5 RO ute = and only one route to the native state is allowed, i.e. 
since all Qi are only zero or one at any degree of nativeness, each successive bond added at that degree of 
nativeness m ust al ways be the same one. 

Using eq. ( 5.32 ) we can relate the fluctuations in optimal energies Sei in terms of fluctuations fro m the 
uniform cont act probabilities SQi as Sei — —XT 8Qi/Q^(l — Q^), and then substitute this along with (5.17) 
into eq. (|5.3| ) to obtain the decrease in barrier height with route measure: 
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SAFts - M ^Q^m = - M ^ n{Qt) - (5 - 39) 



As noted above (see e.g. comments after eq. |5.23 ), the reduction in the barrier height due to ordering 



heterogeneity scales extensively with system size. A dispersion in contact participations SQ 2 = 0.05 which 
is about 20% of the maximal dispersion (ss 1/4, taking Q$ w 1/2) lowers the barrier by about 0.1Nk B T or 
about 5k B T for a chain of length N « 50, believed to model a protein with w lOOaa p7fl . We should note 
here that renormalizing real amino acids into coarse-grained monomers may underestimate the heterogeneity 
effect, because small-scale free energy fluctuations do not average out upon coarse-graining, but will still add 
up extensively. Plots of the route measure as a function of Q for the various possible folding scenarios were 
given in reference E3]; essentially the same information is contained in the route entropy plotted here in 
figure [l3|. 

Figure [Tfi shows the barrier height at the transition temperature T F in units of e, vs. the route measure at 
the barrier peak TZ(Q^) (T F is itself weakly dependent on 1Z(Q$); see e.g. fig. |j). There is a monotonically 
decreasing trend for both theory and simulation. The solid line in figure p~7] is the theoretical result for a 
model with parameters in table B. The barrier vanishes for roughly the same magnitude of route measure 
in both theory and simulation, even though different dispersions in contact energies are required to produce 
these route measures. We can examine the effects of heterogeneity on a uniform, idealized funnel, where all 
loops have length £i = £ « 9, the average of the lattice model native structure of figure [I] and initially all 
energies a — e (long dashed line in fig. [IT]). Here too breaking the ordering sym metry Q — > {Qj\ by random 



perturbations on e, lowers the thermodynamic barrier, as explained in sections 2 2. A and p 3.E| ; the barrier 



goes down because the energy decreases more than the route-entro py de creases. Perturbing the structure b 



including entropic dispersion via S£i yields a simi lar re sult (c.f. eq. ( 5.23 )). The short dashed line in figure 
is the perturbation result for this model from eq. (|5.39), which agrees well with the full non-perturbative result 



for small 1Z. 



5.F. Kinetics of folding times 

We can verify that the heterogeneity effect on the barrier translates directly to an increase in folding rate by 
simulating long Monte Carlo runs and measuring the distribution of folding times at the transition temperature 
T F for various energy functions {ej} (see fig. |l8| ). 

To estimate the increase in rate assuming the prefactor is not important, take the barrier heights from fig. u7\ 
and the neglect the change in transition temperature T F since it only depends weakly on 1Z (see fig. |l2| ). Then 
the ratio of folding times, say at 1Z — and 1Z = 0.2, is 



t f (TZ = Q) _ exp (AF* (U = 0) /T F ) exp (2.1/0.7) 



t f (1Z = 0.2) exp (AFt (K = 0.2) /T p ) exp (1.2/0.7) 



3.6 (5.40) 



which is in reasonable agreement with the measured ratio of about 3 (see fig. Glf) . For the heterogeneous Go 
models considered here, the folding time is well approximated by an Arrhenius law with constant prefactor, 
indicating that the time scale in the reconfigurational kinetics is not strongly influenced by the degree of 
routing, at least up until the barrier vanishes. 



We can expand equation (1.3) to first order in changes in the barrier height SF^, folding temperature 8T F , 
and prefactor Sk to obtain a condition that must be satisfied for the rate to increase under perturbations of 
native energy: 

SF* „ 8k n Fi5T F 



1 p fvo •* F 

where we have made the approximation in the second inequality that the prefactor is most strongly affected 
by changes in the folding temperature, since the closer T F is to T G the slower reconfigurational dynamics is 
within the protein, and we have seen in fig. |l8| that the prefactor is nearly independent of the native energy 
distribution; then dk rs (dk /dT)8T. For well-designed folders, the folding transition temperature T F is abov e 



the dynamic glass temperature of the system, and reconfigurational dynamics is largely unactivated [117, 118], 



i.e. escape from individual traps does not dominate the kinetics. In this regime, the prefactor, which is related 
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to t he ho pping time, 
eq. (5.41) becomes 



is nearly temperature independent: d\nk /dT <C F^/T 2 . Then the condition given in 



SF^ < ST F 



Ft 



T F 



(5.42) 



For slower folders, effects on the prefactor may be as important as effects on the barrier, however for the 
well-designed folders considered here the rate increases so long as the relative barrier height decreases more 
strongly than the relative change in folding temperature. This is seen to be the case in figure n9l 



5.G. Mean loop length dependence of the barrier height 



Experimental evidence has shown a strong correlation of folding rate with a quantity in our model equal to 
the mean loop length divided by the total chain length |3(j]. Since no strong correlation with N is observed 
at least for typical protein sizes, we are interested in testing if the barrier height in our model correlates with 
I, at fixed N. 

We seek the change in free energy SF upon a change in the quantity (1/M) J^i £%• This can be found by 
utilizing the directional derivative (see Appendix A and eq. A. 5): 
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Using again the analog of eq. (5.6) that we used already to obtain eq. (5.21), the total derivative of F with 
respect to £j is equiv alent to the partial derivative. The free energy F depends on l^ explicitly only through 
the bond e ntrop y eq. ( 4.26 ), which is composed of a mean-field term depending on the sum £ plus a fluctuation 
term, eq.s ( |4.27|) and ( [4.281 ). Noting that 
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The first term in expression (5.44) is always positive for Q > 0. The second term weights loops with smaller 
li more heavily, and for these loops 8Q > 0, so the second term is always positive when entropic effects 
are considered alone. The native energies would have to be specially tuned to change the sign of this term. 
Moreover the whole expression is zero when Q = 0, so we conclude that the effect of increasing the mean 
loop length is to increase the barr ier h eight AF* . This effect is illustrated in fig. g^ for the simple case where 
£i = J, wh ere th e second term in ( 5.44 ) is zero; this is a lower limit to the actual increase in barrier. 

As eq. ( [5.44 ) implies, the change in barrier height with mean loop length is an entropic effect; proteins 
with native structures having larger mean loop length have lower entropy near the transition state. Another 
perhaps simpler way to see this is to note that the entropy of loo p clos ure m ust become larger (more negative) 
as the loop length for that contact is increased. From eq.s ( 4.19| ) and ( 4.20 ) and setting £i=1 for purposes of 
illustration, 
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Therefore more entropy is lost in contact formation for structures with larger mean loop length. Furthermore 
since 
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(5.46) 



this effect is largest at low degrees of nativeness (e.g. from eq. ( J5.45 ) at Q = 0, dsijdl ~ —1/1 while at Q = 1 
dsi/ d£ R3 0): the entropy becomes more of a convex down function as £ is increased, see fig. Ell Since the free 
energy barrier arises from the incomplete cancellation of entropy and energy (which is independent of £) as Q 
increases, a more convex down entropy indicates a larger barrier height. 
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5.H. Dependence of barrier heights and rates on structural variance 



By eq. ( |5.23 ), if we let ti = e and fix £, the folding barrier is lower for structures with larger variance in loop 
energies 5£ 2 . For proteins sufficiently well-des igne d that the folding rate k F near the transition temperature is 
governed by the free energy barrier as in eq. Jl.3| ) , then 
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(5.47) 



where we have also neglected changes in the folding transition temperature, since accounting for this is a 
higher order effect. We have also let A* sa 1 — Q$ since a in eq. (4.12) is approximately one (see table I). 



Most importantly the perturbation result neglects changes in the unfolded free energy on structural variance, 
as well as changes in the amount of native structure in the unfolded state. These reduce the trend on the rate 
due to structural variance. In general we should use 
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k F (S£ 2 ) „ AF*(0) AFt(5£ 2 ) 
fc p (0) ~ T F (0) T F (W) 



(5.48) 



for the log ratio of rates. The barrier height is then obtained from eq.s ( 4.30| ) and (4.35). It is seen in 
figure [22|that there is a significant increase in folding rate for structures having larger variance in loop lengths. 
Structural variance is generated here for a system with parameters in table I, but the loop lengths are given 
by 



ti = £ + a (£° - £) 



(5. 



where £° is taken from the loop length distribution in table I. As a varies from zero to one, the mean loop 
length J remains unchanged {£ = 9.14), but the structural variance S£ 2 increases (see fig. (|2l|)). 

5.1. Tuning energy functions to match desired potentials 

We have so far focused on how folding thermodynamics and mechanism are affected by properties of the 
native structure and distribution of native stability. We can also turn the problem around and seek the 
native structural properties or stability distribution that would give a specified free energy profile or folding 
mechanism. To ill ustra te, fig. |2J| shows the free energy potential F(Q) for the 27-mer chain. A fit to the 
lattice data of ref. [ 113 | for example can be made by annealing the stability distribution {e^}, contact length 
distribution {£{\, and/or other parameters with a cost function 



Cost= / dQ (F target (Q)-F(T,e,{ei},£,{ei},s ,b,a\Q)) 2 . 

Jo 



(5.50) 



Fits may be made to a fairly arbitrary potential by adjusting the native coupling energies or loop length 
distribution (the structure), e.g. potentials with and without intermediates, potentials having sharply peaked 
or flat barriers, etc. See fig. |23|B for an example of annealing the energies to fit a target potential with an 
on-pathway intermediate. Using a sufficiently accurate numerical theory of entropy loss, this method should be 
able to distinguish between intermediates governed by native structural properties, native energetic stability, 
or misfolding. 



6. SUMMARY AND CONCLUSIONS 



In this paper we have introduced refinement and insight into the funnel picture by considering heterogeneity 
in the folding of well-designed proteins. We have explored in minimally frustrated sequences how folding is 
effected by heterogeneity in native contact energies, as well as the entropic heterogeneity inherent in folding 
to a specific three-dimensional native structure. The general method we utilized here should be amenable to 
systematic refinement, and should be sufficiently accurate to compare with experimental results. 

Specifically we found that heterogeneity, whether energetic or entropic in origin, will always lower the fold- 
ing free energy barrier; see sections 2 2. A, 3 3.B, and 5 5. A, equations (2.12), (3.5), fl3.16| ), (5.18), ( 5.23| ), 
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and ( p.39|), and figu res |ll| , [12L and [L7j. This was sho wn using a rguments from the random energy model 
(sect ion [33.B3.BJ ), transition state theory (section |3 3.B3.bT2 | ), the optimum fluctuation method (sec- 
tion 3 3.B3.B.3D , th ermod ynamic perturbation theory (section |3 3.B 3.BT ), and free energy functional theory 
(sections 2 2. A a nd | 5 5. A ). For su fficiently well- designed proteins the corresponding rate also increases; see 
equations ( |3.6| ), ( 3.11| ), and ( 5.19 ), section 55. F , and figures [l8|, [19| and g2|. The effects of heterogeneity on 
barrier s an d rates are stron ger th an the effects on transition temperature; see the discussion in the introduction 
on eq. (1.1), and equation ( 5.42 ), and figures [l2| and [ljj We inves tigate d the effects on the fo lding barri er du e 
to correlations between energetics and topology, see sections 33. A, and 55. A, and eq.s (|3~^), ( 5.13J ) and ( 5.23| ), 



and found that for well-designed proteins the rate may be increased by making initially likely contacts stronger 
while making unlikely contacts weaker. Thus overall stability is conserved, but the energetic distribution is 
coupled to the native structure. 

For the ensemble of well-designed sequences having a given overall stability, homogeneously ordering se- 
quences have the largest folding free energy barrier. For most structures, where topological factors play an 
important role, this regime is achieved by introducing a large dispersion in the distribution of native contact 
energies which in practice would be almost impossible to achieve, see figs O and 16 C. As we reduce the disper- 
sion in the contact energy distribution and the energies approach a uniform value e, the dispersion of contact 
participations increases and thus the number of folding routes decreases, the f ree e nergy barri er decreases and 
the total configurational entropy increases to a maximum, see sections 3 3.C and 5 5.C5.C.2. Again, folding 
temperature is only mildly effected; the prefactor appearing in t he r ate is probably only mildly effected also, 
since it is largely a function of T F /T G and polymer properties |113|. If many-body forces are not too large 
the barrier may be reduced to zero, either by adding random native heterogeneity as in section 3 3.B 3.BTJ or 
by correlating native energy to native structure so that more probable contacts are stronger, as in section 0. 
The funnel picture, with different structural details, is valid for the above wide range of native contact energy 
distributions. However, tuning the energies further so that probable contacts have even lower energy (or 
allowing native energies to have a very large variance) eventually induces the system to take a single or very 
few folding routes at the transition temperature. A large dispersion of energies is required to achieve this, and 
in this regime the folding temperature drops well below the glass temperature range, where folding rates are 
extremely sl ow. 



In section 44.B we derived approximate expressions for the conformational entropy functional for a well- 
designed protein, see eq.s (4.11), (4.26), and ( |4.27| ). In section 44.B4.B.1 and Appendix pi we generalized the 
entropy of native core placement (the mixing entropy) used previously in models of folding p6|,|97|] to account 
for the effects of chain connectivity; for a hi ghly constra ined chain, many contact patterns are degenerate to 
essentially the same conformation. In se ction 14.B4.B.2 we derived a general condition for the conformational 
entropy to be a state function, viz. eq. (4.15). A Hartree approximation was taken to account for the entropy 
loss of loop closure in the presence of other contacts already formed. This agrees well with the average behavior 
on the lattice (see fig. ^6|), however there are important fluctuations away from t he me an, indicating many-body 
correlation effects are present which the theory doesn't account for. Equation ( 4.26| ) gives the conformational 
entropy loss given a distribution of native contact lengths {£i\- When each £i — > £, the expression reduces to 
eq. (|4.27 |) which is the entropy loss f or a finite system with mean return length £ for all contacts. When £ 



( 4.27 ) f urth er reduces to eq. ( 4.29 ) which is the entropy loss for a polymer system in the Flory mean- field 
theory |l00| . 

Residues in proximity are assumed to be in contact energetically; the reduction in conformational entropy 
at low Q to the elimination of conformations which happen to have residues in proximity |9q | is not included 
here because it is a smaller effect than the other contributions to the entropy. 

Several experiments support results from o ur t heory. Enhancement of folding rates by weighting entrop icall y 
likely contacts, as found in sections 33. A and 55. B, has been observed in Escherichia coli Che Y [119]. 
Depending on the variance of native interactions and how native interaction strength correlates with the 
entropic likelihood of contact formation, sequences may be designed to fold both faster or slower to the same 
structure as a wild-type sequence. Enhancements or suppressions of folding rate to a given structure due to 
changes in sequence are modeled in our theory through changes in native interactions, which induce significant 
changes in the rate-governing free energy landscape of a well-designed protein. A minimally frustrated sequence 
may fold to a given native structure by a variety of folding mechanisms (see fig.s O and 111), including through 
both on and off-pathway intermediates (see fig. k3). Thus for example folding in Im7 and Im9 may likely initiate 



from different places within the native structure depending on the distribution of stability |16|. Folding in the 



IgG binding domain of protein L may tend to initiate from a specific region of higher stability, indiscernible 
from the apparently symmetric native structure [120]; contact f orma tion probability at the transition state 
depends on both energy and entropy, as expressed in equation (4.35). However, for a large range of native 
energy distributions, barrier heights, and corresponding rates, a funnel folding mechanism is preserved, in 
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that there are many routes to the native structure, see fig.s |7] and 13. Folding rates in m utant protein s that 
exceed those of the wild type have been receiving much interest in recent experiments p7, 119] , 121-123]; here 
we saw how these effects can be understood by applying general principles of the energy landscape. Folding- 
rates in the theory were seen to increase with the variance in contact formation probability, a thermodynamic 
quantity closely related to the dispersion in experimental <f> values (see fig . |l7| ) . The general trend of reduced 
rate with larger contact order BO] is well captured by the theory (see fig. |20[ ). Additionally, for fixed contact 
order, folding rate was shown to increase with larger variance in the contact lengths which constitute the 
native structure (see fig. p2| ). 

Fluctuations in rate due to the effect of sequence perturbations on weakening or strengthening specific 
non-native kinetic traps or generally changing non-native interaction strength is an interesting topic of future 
research. 

It is important to note that the enhancements or reductions in rate we have explored are mild compared 
to the enhancement by minimal frustration (funneling the landscape): the fine tuning of rates may be a 
phenomenon manifested by in vitro or in machina evolution, rather than in vivo evolution. Nevertheless 
folding heterogeneity may become an important factor for larger proteins, where e.g. stabilizing partially 
native intermediates may increase the overall rate or prevent aggregation. Adjusting the backbone rigidity or 
the non-additivity of interactions pUpq] can also modify the barrier height, possibly as much as the effects we 
are considering here. There may also be functional reasons for non-uniform folding - malleability or rigidity 
requirements of the active site may inhibit or enhance its tendency to order. 

The notion expounded here that rates increase with heterogeneity at little expense to native stability 
contrasts with the view that non-uniform folding in real proteins exists merely as a residual signature of 
incomplete evolution to a uniformly folding protein. Moreover, the phenomenon that fluctuations in native 
contact energies contribute extensively to the free energy landscape indicates that the prediction of numerical 
values for folding rates and mechanisms from approximate energy functions may be even more difficult than 
originally suspected, i.e. even if systematic error in the calculation of potentials is eliminated, 0(Af) corrections 
may still remain. 

The amount of route narrowness in folding was introduced as a thermodynamic measure through the mean 
square fluctuations in a local order parameter. The route measure may be useful in quantifying the natural 
kinetic accessibility of various structures. While structural heterogeneity is essentially always present, the 
flexibility inherent in the number of letters of the sequence code limits the amount of native en erget ic hetero- 
geneity possible. However some sequence flexibility is in fact required for funnel topographies [124] and so is 
probably present at least to a limited degree. 

We have seen here how a very general theoretical framework can be introduced to explain and understand 
the effects of heterogeneity in native stability and structural topology on such quantities as folding rates, 
transition temperatures, and the degree of routing in the funnel folding mechanism. Such a theory should be 
a useful guide in interpreting and predicting future experimental results on many fast-folding proteins. 
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A. APPENDIX A 



Consider the free energy of eq. fl4.8| ) as the integral over a semi-local free energy density F({Qi}) = 
J2i fi(Qi> Q) = J2i fi(Qi> 2 1 Qj)- Taking the differential of a new thermodynamic function G = F + fi ^\ Qi 



5G = J2 
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(A.l) 



and demanding that dG/dQj — for all j Legendre transforms to a new variable /i, with dG/d/i = MQ. This 
is equivalent to minimizing the free energy subject to the constraint of fixed Q. The equation dG/dQj = 
means that 
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for all i, which enforces equation ( 4.35 ) for each Qi. The Lagrange multiplier /i is interpreted as the force 
corresponding to the potential F(Q): 
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by the following arguments. From eq. ( |A.1[ ) (dG/dQi) = is equivalent to dF/dQi + jj, = or 
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(A.3) 



(A.4) 



therefore /i = —(dQ / dQi)(dF / dQ) = — (l/M)(dF/dQ) . Sinc e th e changes w.r.t. to the local order parameter 
of all the local free energy terms are the same number fj, (e q. |A.2j ), this number equals the change in the sum 
(F) w.r.t. the change in the sum of the Qi (MQ) (eq. |A.3[). 
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Another way to see eq. (A.3) directly is to consider d/d(MQ) — d/d(^2 i=1 Qi) as the directional derivative 
D U F of F in an M-dimensional space along the direction VQ = ^Zi * with i a unit vector along the ith axis, 
defined by the ith contact. So 
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For the two-state potentials considered here fj, has two roots which give the positions of the barrier peak 
and equilibrium unfolded state (where the local force is zero, see inset of fig. (p3)). 



B. APPENDIX B 



In this appendix we compare a si mple micros copic model of route entropy <S ROU te {{Qi}) with the semi- 
empirical model introduced in section 44.B4.B.1 . In the absence of the constraints due to chain connectivity 
the route entropy is given by eq.s (4.9) and ( 4.1C ). Each native contact pattern {Qi} can be expressed on 
a native contact map, (see e.g. ref. |125|). The idea here is that since several contact patterns correspond 
to the same constrained state because of chain-connectivity, the contact-map can be coarse-grained into cells 
inside of which one or more contacts fully constrains the cell. We make a mean-field approximation and group 
configurations into cells all of the same Q-depend ent size, lo q > 1, with u>o = 1. To clarify, imagine a system 



with 6 possible native contacts, and 2 made. Eq. ( |4.9|) gives 15 possible native cores, but if OJ1/3 = 2 there are 



3!/l!(3 — 1)! + 3!/2!(3 — 2)! = 6 configurations; the first 3 configurations correspond to 2 contacts together in 
any one of the 3 renormalized cells, and the second 3 correspond to one contact in two separate cells. 
In general if there are MQ contacts made of M total, with a renormalization factor cj q , there are 
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total configurations. Here j MAX — int [min (MQ, M/uj q ) — (MQ/uj q + 1/2)]. For example if M = 8, MQ = 5, 
and W5/ g = 2, there are 5 total configurations instead of 56 assu ming binary mixing. In the thermodynamic 
limit the entropy is obtained from the largest term in the sum of (B.l): 



fL 



m 



max 
i 

< 3 < JMAX 



(M/U3 Q )\ 



[int (MQ/uo Q + j + 1/2)]! [M/u Q - int (MQ/u> Q + 3 + 1/2)]! ' 



so that taking dhiVt/dj — gives 
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j* (Q) = — (1/2 - Q) 

Wq 

where < j* (Q) < min (MQ, M/u> Q ) - MQ/u Q 



Letting J"* (Q) = Mj* (Q) and applying the boundary condition eq. ( |B.4| ) gives 
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where Q* solves Q* = 1/2wq* . The route entropy is then 
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In the limit of the cell size uj q —* 1, J'* (Q) — » and eq. (B.6) reduces to the binary fluid mixing entropy 
eq.s (4.9) and ( 4.10J ), but for lj q > 1 the mixing entropy is reduced. The dashed line in figure M shows a best 
fit to the lattice data for oj q of the form exp(aQ /3 ). This gives a = 2.08 and /3 = 1.85, whose cell size u>, 
shown in fig. E4 



is 
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TABLE I. PARAMETERS USED IN THE ANALYTIC MODEL TO COMPARE WITH SIMULATION 



Chain 


Average 


Average 


Average 


Configurational 


R.M.S. 


Reduction 


length 


bonds 


(homopolymer) 


extra 


entropy 


energy per 


parameter 




per residue 


attraction 


native 


per residue 


non-native 


in route 






per contact 


stability 
per contact 




contact 
(roughness) 


entropy 


(N) 


(*) 


(E/zN) 


(e) 


(So) 


(b) 


(a) 


27 


28/27 


-0.0 


-3.0 


1.352 


0.0 


0.9 





loop length 
distribution 



{3, 3, 3, 3, 3, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 11, 11, 13, 21, 23, 23, 23} 



energies inducing 
uniform folding 



{4.9, 4.5, 2.1, 2.6, 2.7, 5.0, 4.1, 2.3, 6.0, 1.0, 2.1, 1.6, 2.5, 4.0, 
1.7, 3.0, 5.0, 3.7, 2.0, 2.4, 1.1, 2.1, 3.3, 3.1, 1.0, 5.5, 2.9, 2.0} 
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2. A. Figure Captions 

FIG. |l] Schematic, lattice, and off-lattice representations of the native structure, characterized through 
the distribution of contact energies {e^} and contact entropies {s^}, (defined through the distribution of loop 
lengths ££i})- The probability to form contact i having energy e, and loop length ii is Qi. 

FIG. H| The fraction of time a contact is made vs. Q, Q*(Q) for folding to the lattice structure shown in 
fig. 1. Dashe d black curves are the result of the functional theory of section 0, using an energy function given 
by eq. ( |5.24|) w ith a = 0.5. Thin solid curves are Monte Carlo simulation results for folding to this structure, 



using eq. (5.24) with a = 1. Some contacts are formed relatively early while others remain unformed until the 
protein is largely native; the magnitude of this dispersion in contact formation is captured fairly well by the 
theory. 

FIG. g Making contact 1 stronger in the a-spectrin SH3 domain should speed folding more than making 
contact 2 stronger by the same amount, since c ontact 1 in the distal loop is more likely to be formed in the 



transition state than contact 2 in the RT loop. [126, 127|. 

FIG. The reduction in effective thermodynamic barrier due to the presence of native heterogeneity. The 
probability of a nucleation event dies off exponentially with barrier height k ~ exp(— F/T), and the probability 

distribution P(F) of barrier heights is centered on F and has some width in the presence of energetic and 
cntropic inhomogeneity. The effective nucleation rate k(F) ~ P(F)k(F) has a maximum at an effective barrier 

F* (T) < F corresponding to the optimum fluctuation. Figure adapted from ref . [[78| . 

FIG. g| Free energy vs. fraction of native contacts Q, obtained from off-lattice simulations using a uniform 
Go potential to the native structure of CheY (data from 139]). Fluctuations in the free energy profile here 
depend solely on how the native topology affects the entropy at partial degrees of nativeness. The profile 
observed is the optimal fluctuation from uniform ordering, given the native structure under study. 

FIG. |sj Illustration of partially native configurations consisting of native cores and the surrounding polymer 
halo. The core may be globular or ramified. 

FIG. Route entropy for the 27-mer with native structure shown in fig. H. The dotted curve is the putative 
binary fluid mixing entropy in the absence of the backbone. The solid curves includes a prefactor (1 — Q a ) 
corresponding to the entropic asymmetry of applying contacts to an unconstrained polymer and removing 
contacts from a fully constrained polymer, described in the text. The data are for the 27-mer lattice model 
shown in fig. ffl ]128| ], and are obtained for low Q values by making subsets of M(l — Q) contacts repulsive, 
MQ contacts attractive, and then finding the most native-like state in a low temperature quench. For high Q 
values they are obtained by making random sets of M{\ — Q) contacts repulsive, and counting the remaining 
native cores which are distinct. This method finds the reduction in binary fluid mixing entropy due to chain 
connectivity and particular native topology of the protein under study. We assume that a is essentially 
constant for a given native structure, independent of the distribution of native energies. The computation 
treats all contact formation probabilities on an equal footing (all Qi = Q), and so the ro ute en tropy plotted 



is an upper limit to the actual route entropy present. See also fig. 13. Using eq.s (4.11), (4.12) with Qi = Q 
gives a = 1.37 for the best fit to the lattice structure in fig. [j], while somewhat smaller values of a fit the free 
energy function best (table | ). The dashed curve is the best fit of a simple microscopic theory to the route 
entropy, described in Appendix B. 

FIG. ra Illustrating constraints on the functional form of the entropy, given it must be a state function. 
Path (1) dashed, Path (2) solid. 

FIG. |9] The total energy of the 27-mer lattice model, in units of e, as a function of the fraction of native 
contacts Q, for native energies {e*} tuned to give a homogeneous transition state (thin solid curve), for uniform 
native energies {e} (thick solid line), for native energies {e°} which induce enough heterogeneity to eliminate 
the barrier (long dashed), and for native energies {ef } which induce a folding mechanism by only a few routes 
(short dashed). 

FIG. [lC] Total entropy vs. Q for the 27-mer lattice model. The maximal solid curve is for {e;} = {e}. For 
these energies the entropy has its highest value. The entropy for the homogeneous funnel with tuned energies 
{e*} (thin solid curve) is significantly lower. For native energies {e°} which eliminate the barrier (long dashed), 
the entropy is essentially unchanged, since changes in entropy near the extremum (where {e^} = {e}) are second 
order. The entropy for a route-like folding mechanism (short dashed) is lower than the other curves since the 



route entropy (eq. 4.11) no longer contributes significantly to the total entropy at temperatures where the 
native state is stable. 



FIG. 11 Free energy vs. Q in units of e at the folding temperature T F ({ei}) for the 27-mer lattice model 



(TOP) and theory (BOTTOM), for the energy functions described in the text and previous 2 figures. 
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FIG. 12 Barrier height in units of native interaction strength as a function of RMS native energetic 
variance in the same units, for simulations of the Go lattice model. The barrier height is multi-valued here 
because of energetic correlations with loop lengths, as described in the text. Also shown is the folding 
transition temperature, which remains fairly constant until the barrier disappears and the RMS energetic 
variance becomes large. Another measure of how well-designed the protein is weights T F by the root entropy 
per monomer at the barrier peak, which also remains fairly constant until the barrier disappears (thin solid 
line with squares). 

FIG. O Route en tropy for the 27-mer lattice model (Top) and free energy functional theory (Bottom), 
as given by equation (4.11). The route entropy is a thermodynamic measure of how labile or itinerant the 
native core of the protein is at intermediate stages of folding. Different line styles represent different folding 
mechanisms described in the text (the same code is used as in fig.s 0, Ru, and O). 

FIG. p] Contact formation probability Qi(Q) as a function of Q. (TOP) Simulation results, (BOTTOM) 
Analytica theory. (LEFT) Coupling energies are tuned to induce a route-like folding mechanism, (RIGHT) 
coupling energies are tuned to induce a uniform funnel folding mechanism through the barrier peak. 

FIG. Qq Flory entropy of the system, defined by subtracting the route entropy from the total entropy, 
as a function of the fraction of native contacts. The top plot is the result from simulating the lattice 27-mer. 
The bottom plot is the result of the analytical theory (the same line styles as fig.s 0, pi, O, and O are used). 

FIG. |l6| Comparison of the theory and simulations, for a set of energies inducing an intermediate folding 
barrier height in (A),(B) (a SIM = 1.0, a THRY = 0.5 in cq. ( [3.24 )), and for energies e* which induce a uniform 
folding mechanism (C). (A) Contact probabilities at the barrier peak. (B) Contact probability vs. log loop 
length. Cr osses are t he la ttice data, squares are their average, and the solid line is the theory result (eq. 4.35 
with eq.s (4.19) and (4.20)). (C) plots the energy fluctuation required to tune the system to fold uniformly vs 
the entropy fluctuation each loop has. The solid line is the theoretical result eq. (5.13) and the data points 
with error bars are extracted from the simulations. 

FIG. [IT] Free energy barrier at the transition temperature T F in units of e, vs. the route measure at the 
barrier peak 1Z(Q$). (Solid line) theoretical result for a model with parameters in table Q. (Lon g dashed) 
idealized funnel with ti = 2, and e, = 1 initially. (Short dashed) perturbation result from eq. ( 5.39 ). 

FIG. [lq Simulating a long Monte Carlo run for the 27-mer lattice model at T F yields an exponential 
distribution of first passage times to the folded state from the unfolded state(inset B, plotted for the point 
with 1Z — 0.11), indicating a single time-scale governs the kinetics. The mean of the distribution (the folding 
time r F ) is plotted here for a few energy functions which result in varying amounts of routing through the 
barrier by eq. ( J5.37 ). For comparison, the folding time determined from r = const, x exp(— AF$ /T F ) is shown 
as a solid line, with AF$(1Z) taken from the lattice data in fig. [17] (there are error bars about the points 
defining this line which are not shown). The folding time is a decreasing function of the route measure TZ(Q^). 
(Inset A) The log of the folding time varies nearly linearly with the barrier height indicating that the prefactor 
is roughly constant while the native energy distributions are varied. 

FIG. [19] The relative barrier height decreases more strongly than the relative folding temperature, as a 
function of route measure at the barrier peak (eq. |5.37|) . Thus the condition for the folding rate to increase, 
eq. fl5.42|), is satisfied. 

FIG. |20| Dependence of the free energy profile F( Q) at T F on the mean loop length 2, for the analytic 
model with length N — 50, £i = 2, and a = e (eq.s ( |4.30| ) and (4.31)). 2 labels each curve. The barrier 
undergoes an increase that is stronger initially. The inset plots the barrier height as a function of 2, in units 
of e. The trend in barrier height with £ shown here is a lower limit to the full theoretical dependence given in 
eq. dJTiil) 

FIG. |21| Entropy vs. Q for e ; = e and £i — £, for various t. The conta ct ord er = £/N (N = 27 here) labels 
each curve. Results are for the theory with parameters in table I (c.f eq. 4.27). As £ increases, more entropy 
is lost initially, leading to a larger free energy barrier and correspondingly slower folding rate. (INSET): The 
model shows a weak increasing dependence of 9' m value with contact order, defined here as the relative degree 
of partial order at the barrier peak: 9' m = ( Q> — Q v )/(1 — Q v ). The trends seen here are again lower limits to 
the full dependence on 2 given in eq. (5.44), we illustrate just the mean-field term here. 

FIG. E3 Log of the ratio of rates give n by eq. (5.48) as a function of structural variance S£ 2 at fixe d 
2, obtained by following the recipe of eq. ( 5.49J). ( Dashe d) A pproximate perturbation result of eq. ( 5.47 ) 



(Solid) Full non-perturbative result using eq.s ( 4.30| ) and (4.35), which accounts for changes in the unfolded 
free energy with increasing variance. The barrier is calculated at T F , which changes only mildly with S£ 2 until 
the barrier height approaches zero at 5£ 2 /2 « 0.25. 

FIG. |2^ (A) Squares are the free energy potential obtained in ref. [113] to the str uctu re shown in 
fig. [yB. Solid curve is a fit to the data by annealing the native coupling energies using eq. ( [5.50] ). (INSET): 
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The effective force n(Q). The two zeros of fj, occur at the positions of the unfolded state Q° and barrier peak 
Q*. (B) Fitting to a potential containing an on-pathway intermediate by adjusting the coupling energies 
{^i}- The illustration here is for a system of length N — 57 with 142 contacts, having the native loop length 
distribution of the a-spectrin SH3 domain 139]. The unfolded and folde d stat es are at Q — 0, 1 since the bulk 



limit (£Q ^> 1) was taken in determining the potentials here (c.f. eq. 4.2E ). A specific part of the protein 
is biased energetically with successively larger strength to achieve an on-pathway intermediate. This could 
also have been achieved in the theory by adjusting the native structure to have contacts which are relatively 
likely entropically. Finding the set(s) of loop lengths {£*} that best fit a given target potential is relevant to 
the problem of which native structures would typically have such a potential, if native energetic features are 
not as important. Progression of the effective force fJ-(Q) is also shown in (C); /j.(Q) corresponding to the 
potential with the intermediate has 3 roots, one at the intermediate position, one at the barrier peak, and 
one at a smaller barrier peak near the unfolded state. (D) The barrier height here deviates from its typical 
dependence on the route-measure, since the route-measure becomes largest at the value of order parameter 
characterized by the intermediate, rather than at the barrier peak. 

FIG. p3 Renormalized mixing cell size lu q as a function of Q, assuming the form ui Q — exp (aQ@) with 
a = 2.08 and /3 = 1.85, which are the numbers giving the best fit to the lattice data (dashed curve in fig. 0). 
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